Kubeflow on a Bare-Metal GPU Cluster from Scratch

Vadim Markovtsev, Athenian & Fragile Tech.

Kubeflow on a Bare-Metal GPU Cluster from Scratch

Vadim Markovtsev
Head of Analytics and ML at Athenian
Founder at Fragile Tech.

INTRODUCTION

About me

What is Kubeflow

HARDWARE

Scheme

Linux

IPMI

Configuration management

KUBERNETES

k0s

k0s commands on controller

            sudo k0s install controller
            sudo systemctl start k0scontroller.service
            sudo k0s token create --role=worker
            sudo snap install kubectl --classic
        

k0s commands on controller

sudo cp /var/lib/k0s/pki/admin.conf .
sudo chown $(whoami) admin.conf
export KUBECONFIG=$(pwd)/admin.conf
export clusterUser=$(whoami)
kubectl create clusterrolebinding $clusterUser-admin-binding \
    --clusterrole=admin --user=$clusterUser
mkdir -p ~/.kube && export KUBECONFIG=
sudo k0s kubeconfig create --groups "system:masters" \
    $clusterUser > ~/.kube/config

k0s commands on worker

            k0s install worker --token-file /path/to/token/file
            # debug
sudo ctr --address /run/k0s/containerd.sock -n k8s.io
        

Problems

NVIDIA GPUs

            metadata:
  spec:
    ...
    containers:
    - ...
      limits:
        nvidia.com/gpu: 2
        
# ls /dev/nvidia0
/dev/nvidia0 /dev/nvidia1

NVIDIA GPUs

  1. Install NVIDIA driver on each worker
  2. Add nvidia-smi to rc.local 🤦‍♂️
  3. Install nvidia-container-runtime
  4. Patch containerd configuration on the worker nodes for /run/k0s/containerd*
  5. Apply daemonset from Google to support nvidia.com/gpu resource
  6. kubectl label nodes --all cloud.google.com/gke-accelerator=true

NVIDIA GPUs

$kubectl get daemonset -n kube-system nvidia-gpu-device-plugin
NAME                       DESIRED   CURRENT   READY   UP-TO-DATE
nvidia-gpu-device-plugin   3         3         3       3
        

Need GPU overcommit for the lab 🤔

DISTRIBUTED FILE SYSTEM

No shared files => no ML

Options

Cross NFS

File flow

Pros and cons

Integration with Kubernetes

kubectl patch storageclass local-path -p '{"metadata":
{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
$ kubectl get storageclass
NAME                   PROVISIONER             RECLAIMPOLICY   AGE
local-path (default)   rancher.io/local-path   Delete          13d

KUBEFLOW

Ways to install Kubeflow

  1. Packaged distribution
    • e.g., for a specific cloud provider: AKS, GKE, etc.
  2. Manifests aka "advanced", "on premises", etc.
    • Istio to control traffic
    • Dex to provide OpenID authentication (e.g., GitHub, Google, Okta, Auth0)

Installation

export BASE_DIR=/opt
export KF_NAME=my-kubeflow
export KF_DIR=${BASE_DIR}/${KF_NAME}
# Download kfctl from https://github.com/kubeflow/kfctl/releases
kfctl apply -V -f kfctl_istio_dex.yaml

Problems

SUMMARY

Medium:
Deploying Kubeflow to a Bare-Metal GPU Cluster from Scratch

Thank you

bit.ly/3qTtovQ