Kubeflow on a Bare-Metal GPU Cluster from Scratch

Vadim Markovtsev
Head of Analytics and ML at Athenian
Founder at Fragile Tech.

INTRODUCTION

About me

Past: led a team (9) of ML engineers
- Distributed Deep Learning
- MLonCode (similar to NLP); 3 publications
- MLOps
Today: high performance SW engineering lifecycle analysis
Google Developer Expert in ML
Started career in 2009

What is Kubeflow

ML stack from Google based on Kubernetes
Quick Jupyter
Quick Tensorflow (+ PyTorch)
Pipelines, Pachyderm

HARDWARE

Scheme

Linux

Ubuntu LTS
Broken UEFI, had to hack in kernel terminal
Disable security patches with performance overhead
- Up to 800% faster
- Requires a fully trusted environment, isolation from production, etc.
Disable IOMMU for GPU peer-to-peer communication

IPMI

Configuration management

...yes, you need it
Options:
- Ansible
  - Smooth learning curve
- Terraform
  - Optimized for Kubernetes

KUBERNETES

k0s

Easy install
Easy configuration
Reasonable batteries included
Vanilla Kubernetes

k0s commands on controller

            sudo k0s install controller
            sudo systemctl start k0scontroller.service
            sudo k0s token create --role=worker
            sudo snap install kubectl --classic

k0s commands on controller

sudo cp /var/lib/k0s/pki/admin.conf .
sudo chown $(whoami) admin.conf
export KUBECONFIG=$(pwd)/admin.conf
export clusterUser=$(whoami)
kubectl create clusterrolebinding $clusterUser-admin-binding \
    --clusterrole=admin --user=$clusterUser
mkdir -p ~/.kube && export KUBECONFIG=
sudo k0s kubeconfig create --groups "system:masters" \
    $clusterUser > ~/.kube/config

k0s commands on worker

            k0s install worker --token-file /path/to/token/file
            # debug
sudo ctr --address /run/k0s/containerd.sock -n k8s.io

Problems

kubectl logs and kubectl exec time out with 80% chance
- ⏎ Workers had 2 Ethernet interfaces, Calico required a manual override
ImagePullBackOff on system pods with 50% chance
- ⏎ Wrong DNS setup

NVIDIA GPUs

            metadata:
  spec:
    ...
    containers:
    - ...
      limits:
        nvidia.com/gpu: 2

# ls /dev/nvidia0
/dev/nvidia0 /dev/nvidia1

NVIDIA GPUs

Install NVIDIA driver on each worker
Add nvidia-smi to rc.local 🤦‍♂️
Install nvidia-container-runtime
Patch containerd configuration on the worker nodes for /run/k0s/containerd*
Apply daemonset from Google to support nvidia.com/gpu resource
kubectl label nodes --all cloud.google.com/gke-accelerator=true

NVIDIA GPUs

$kubectl get daemonset -n kube-system nvidia-gpu-device-plugin
NAME                       DESIRED   CURRENT   READY   UP-TO-DATE
nvidia-gpu-device-plugin   3         3         3       3

Need GPU overcommit for the lab 🤔

DISTRIBUTED FILE SYSTEM

No shared files => no ML

Options

Cloud storage
- Terrible performance
GlusterFS
- Bad performance on many small files
Hacky NFS
- Star topology, but great performance

Cross NFS

Everybody mounts everybody else by the same path through NFS
Join together using mergerFS
Integrate with Kubernetes using Rancher’s local path provisioner

File flow

Owner of the file accesses it locally
Everybody else access it through NFS
Directory structure replicates everywhere

Pros and cons

Impossible to corrupt
Local file => the best performance
Direct NFS performance is good
Fast file removals
No fault tolerance
Worse performance on big remote files

Integration with Kubernetes

kubectl patch storageclass local-path -p '{"metadata":
{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
$ kubectl get storageclass
NAME                   PROVISIONER             RECLAIMPOLICY   AGE
local-path (default)   rancher.io/local-path   Delete          13d

KUBEFLOW

Ways to install Kubeflow

Packaged distribution
- e.g., for a specific cloud provider: AKS, GKE, etc.
Manifests aka "advanced", "on premises", etc.
- Istio to control traffic
- Dex to provide OpenID authentication (e.g., GitHub, Google, Okta, Auth0)

Installation

export BASE_DIR=/opt
export KF_NAME=my-kubeflow
export KF_DIR=${BASE_DIR}/${KF_NAME}
# Download kfctl from https://github.com/kubeflow/kfctl/releases
kfctl apply -V -f kfctl_istio_dex.yaml

Problems

CrashLoopBackOff in Katib
- Too little timeout in the Liveness probe
How to access the web UI?
- kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
How to setup GitHub auth?
- Wrestle with Dex configuration
- You need a callback URL for OAuth2

SUMMARY

Medium:
Deploying Kubeflow to a Bare-Metal GPU Cluster from Scratch

Thank you

bit.ly/3qTtovQ