Distributed training of Tensorflow models

Vadim Markovtsev
source{d}

Plan

Introduction to distributed deep learning
TensorFlow Distribution Strategies
Horovod
Problems and limitations

Introduction

via OpenAI

Distributed deep learning

Workflow partitioning
Parallel model
- tensorflow/mesh
Parallel data

Workflow partitioning + and −

Choosing the best device for each task
Some ops are not implemented on all devices
Memory copies
- XLA improves on that
Device memory size bottleneck
Device initialization lag

Parallel model + and −

Better performance
Training huge models
Not trivial to program
IO overhead

Parallel data + and −

Linear speedup
Bigger batch size
Trivial to program
IO overhead

Parallel data

Parallel data training approaches

Parameter averaging
- Saves network bandwidth
- Stateless = vanilla SGD only
Individual updates
- Universal
- Hungry to network bandwidth

Individual updates

Parallel data training approaches

Synchronous updates
- Equivalent result
- Sub-optimal load balancing (N nodes)
Asynchronous updates
- Worse convergence (stale gradients)
- Great load balancing

Synchronous updates

Distribution approaches

Parameter server, "star arch", master-slave, etc.
Collective ops; single node, multiple device
Collective ops; multiple node, multiple device
Parallel inference - TensorFlow Extended (TFX) and TensorRT
Federated Learning

Parameter server + and −

Easiest to implement
Simplicity
Single master bottleneck
Network bottleneck

Collective ops; 1x + and −

Great vertical scalability
Peer-to-peer communication
Single node (apparently)
Hardware-specific (complex) optimizations

Collective ops; Nx + and −

Great horizontal&vertical scalability
Peer-to-peer communication
Overall complexity
No load balancing

Collective ops

nvidia/nccl

Introduction summary

The most promising approach to distributed training:

Data parallelism
Synchronous updates
Collective ops

Distribution Strategies

TensorFlow 2.0

🎉September 30th🎉

TensorFlow 2.0

Keras is a first-class citizen
Keras is the only high-level API
PyTorch-like Eager execution
Distribution Strategies

`tf.distribute.Strategy`

Example

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(1, input_shape=(1,))])
    model.compile(loss="mse", optimizer="sgd")
model.fit(dataset)

+1 extra line

Not the same as
keras.utils.multi_gpu_model

`import keras tf.keras`

Supported strategies

MirroredStrategy (local)
CentralStorageStrategy (local)
TPUStrategy (local)
MultiWorkerMirroredStrategy (network)

MirroredStrategy

Batch is split evenly
Dark shape magic inside
- Careful with hard-coding the batch size!
- Automatic tf.data.Dataset sharding
Collective ops (NCCL)
Model instance per device

MultiWorkerMirroredStrategy

Batch is split evenly
Dark shape magic inside
- Careful with hard-coding the batch size!
- Automatic tf.data.Dataset sharding
Collective ops (NCCL + gRPC)
Model instance per device per process

`training.py`

import tensorflow as tf
strategy = \
    tf.distribute.experimental.MultiWorkerMirroredStrategy()

TF_CONFIG

{
    "cluster": {
        "worker": ["host1:port", "host2:port"],
    },
    "task": {
        "type": "worker", "index": 0
    }
}

host1

export TF_CONFIG=...
python3 training.py

host2

export TF_CONFIG=...
python3 training.py

2019-10-18 19:22:33.199624: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache
        for job worker -> {0 -> localhost:5555, 1 -> 10.2.2.66:5555}

        2019-10-18 19:22:33.206039: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:365] Started server with
        target: grpc://localhost:5555

Horovod

In a nutshell

Up to 1k GPUs
MPI or Gloo
TensorFlow, vanilla Keras, PyTorch, MXNet
Backed by Uber?

Concepts

Size = number of processes
Rank = process index
Local rank = process index on the same machine
Batch size is multiplied, not divided

bit.ly/2PWJhzs

Tensor Fusion

Batching of small allreduce ops.

Usage with `tf.keras` 2.0

import horovod.tensorflow.keras as hvd
hvd.init()
gpus = tf.config.experimental.list_physical_devices("GPU")
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
tf.config.experimental.set_visible_devices(
    [gpus[hvd.local_rank()]], "GPU")

Usage with `tf.keras` 2.0

learning_rate *= hvd.size()
opt = hvd.DistributedOptimizer(opt)
model.compile(..., experimental_run_tf_function=False)
callbacks.append(
    hvd.callbacks.BroadcastGlobalVariablesCallback(0))
if hvd.rank() == 0:
    callbacks.append("""checkpoint""")

Usage with `tf.keras` 2.0

horovodrun -np 4 -H localhost:4 python3 entry.py

via LogicalClocks

Are you sure? 🤔

Problems
and
limitations

GPU hangs

Intel IOMMU + NCCL = 💔

Solution

Boot Linux with intel_iommu=off or intel_iommu=soft
bit.ly/2CkRkxH

P2P

Checking P2P

I tensorflow/.../gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
I tensorflow/.../gpu_device.cc:1165]      0 1 2 3
I tensorflow/.../gpu_device.cc:1178] 0:   N Y N N
I tensorflow/.../gpu_device.cc:1178] 1:   Y N N N
I tensorflow/.../gpu_device.cc:1178] 2:   N N N Y
I tensorflow/.../gpu_device.cc:1178] 3:   N N Y N

Checking P2P

p2pBandwidthLatencyTest github.com/NVIDIA/cuda-samples

P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3
     0   1.25   0.99  12.84  13.39
     1   1.01   1.37  11.21  10.36
     2  12.39  12.33   1.28   1.07
     3  10.39  10.86   1.04   1.27

Checking P2P

Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3
     0 354.15  10.28  11.18  11.16
     1  10.27 152.29  11.16  11.11
     2  11.15  10.85 355.44  10.28
     3  11.15  10.94  10.28 354.15

Network bandwidth

1gbps is not enough

Data pipeline performance

Training Imagenet in 20 min

Quick diagnostics: nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 43%   63C    P2   101W / 250W |  10909MiB / 11178MiB |     77%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 42%   60C    P2   101W / 250W |  10913MiB / 11178MiB |     22%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |                  N/A |
| 41%   60C    P2    93W / 250W |  10909MiB / 11178MiB |     92%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:83:00.0 Off |                  N/A |
| 43%   62C    P2    92W / 250W |  10913MiB / 11178MiB |     69%      Default |
+-------------------------------+----------------------+----------------------+

Quick diagnostics: top

tf.data.Dataset() \
    .from_whatever() \
    .map(augment, num_parallel_calls=N) \
    .batch(batch_size, drop_remainder=True)

PID USER PR NI   VIRT    RES    SHR S  %CPU %MEM    TIME+ COMMAND
 21 root 20  0 0.122t 0.023t 945388 R  1333  9.3 63880:08 python3

Advanced diagnostics: profile

tensorflow.org/tensorboard/tensorboard_profiling_keras

Bugs

Bugs found by myself

Kubernetes is for the bravest

Kubeflow

Summary

TF Distribution Strategies are 🔥💯😍 but also 👶
Horovod is 🧟‍♂️
No (dev|ml)ops? Forget about on-prem DT 🤦‍♂️
You need to tweak performance even in the clouds 😭
Researching 🔬? Hyperopt instead.

Thank you

bit.ly/2NHMhNo

Distributed training of Tensorflow models

Plan

Introduction

Distributed deep learning

Workflow partitioning + and −

Parallel model + and −

Parallel data + and −

Parallel data

Parallel data training approaches

Individual updates

Parallel data training approaches

Synchronous updates

Distribution approaches

Parameter server + and −

Collective ops; 1x + and −

Collective ops; Nx + and −

Collective ops

nvidia/nccl

Introduction summary

Distribution Strategies

TensorFlow 2.0🎉September 30th🎉

TensorFlow 2.0

tf.distribute.Strategy

Example

+1 extra line

Not the same askeras.utils.multi_gpu_model

import keras tf.keras

Supported strategies

MirroredStrategy

MultiWorkerMirroredStrategy

training.py

TF_CONFIG

host1

host2

Horovod

In a nutshell

Concepts

Tensor Fusion

Usage with tf.keras 2.0

Usage with tf.keras 2.0

Usage with tf.keras 2.0

Are you sure? 🤔

Problemsandlimitations

GPU hangs

Intel IOMMU + NCCL = 💔

Solution

P2P

Checking P2P

Checking P2P

Checking P2P

Network bandwidth

1gbps is not enough

Data pipeline performance

Training Imagenet in 20 min

Quick diagnostics: nvidia-smi

Quick diagnostics: top

Advanced diagnostics: profile

Bugs

Bugs found by myself

Kubernetes is for the bravest

Kubeflow

Summary

Summary

Thank you

TensorFlow 2.0

🎉September 30th🎉

`tf.distribute.Strategy`

Not the same as
keras.utils.multi_gpu_model

`import keras tf.keras`

`training.py`

Usage with `tf.keras` 2.0

Usage with `tf.keras` 2.0

Usage with `tf.keras` 2.0

Problems
and
limitations