Duration
35 hours (usually 5 days including breaks)
Requirements
- Familiarity with Python syntax
- Experience with Tensorflow, PyTorch, or other machine learning framework
- An AWS account with necessary resources
Audience
- Developers
- Data scientists
Overview
Kubeflow is a toolkit for making Machine Learning (ML) on Kubernetes easy, portable and scalable. AWS EKS (Elastic Kubernetes Service) is an Amazon managed service for running the Kubernetes on AWS.
This instructor-led, live training (online or onsite) is aimed at developers and data scientists who wish to build, deploy, and manage machine learning workflows on Kubernetes.
By the end of this training, participants will be able to:
- Install and configure Kubeflow on premise and in the cloud using AWS EKS (Elastic Kubernetes Service).
- Build, deploy, and manage ML workflows based on Docker containers and Kubernetes.
- Run entire machine learning pipelines on diverse architectures and cloud environments.
- Using Kubeflow to spawn and manage Jupyter notebooks.
- Build ML training, hyperparameter tuning, and serving workloads across multiple platforms.
Format of the Course
- Interactive lecture and discussion.
- Lots of exercises and practice.
- Hands-on implementation in a live-lab environment.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
Course Outline
Introduction
- Introduction to Kubernetes
- Overview of Kubeflow Features and Architecture
- Kubeflow on AWS vs on-premise vs on other public cloud providers
Setting up a Cluster using AWS EKS
Setting up an On-Premise Cluster using Microk8s
Deploying Kubernetes using a GitOps Approach
Data Storage Approaches
Creating a Kubeflow Pipeline
Triggering a Pipeline
Defining Output Artifacts
Storing Metadata for Datasets and Models
Hyperparameter Tuning with TensorFlow
Visualizing and Analyzing the Results
Multi-GPU Training
Creating an Inference Server for Deploying ML Models
Working with JupyterHub
Networking and Load Balancing
Auto Scaling a Kubernetes Cluster
Troubleshooting
Summary and Conclusion