Distributed training of Tensorflow models

Vadim Markovtsev, source{d}.

Distributed training of Tensorflow models

Vadim Markovtsev
source{d}

Plan

  1. Introduction to distributed deep learning
  2. TensorFlow Distribution Strategies
  3. Horovod
  4. Problems and limitations

Introduction

via OpenAI

Distributed deep learning

Workflow partitioning + and −

Parallel model + and −

Parallel data + and −

Parallel data

Parallel data training approaches

Individual updates

Parallel data training approaches

Synchronous updates

Distribution approaches

Parameter server + and −

Collective ops; 1x + and −

Collective ops; Nx + and −

Collective ops

nvidia/nccl

Introduction summary

The most promising approach to distributed training:

Distribution Strategies

TensorFlow 2.0

🎉September 30th🎉

TensorFlow 2.0

tf.distribute.Strategy

Example

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(1, input_shape=(1,))])
    model.compile(loss="mse", optimizer="sgd")
model.fit(dataset)

+1 extra line

Not the same as
keras.utils.multi_gpu_model

import keras tf.keras

Supported strategies

MirroredStrategy

MultiWorkerMirroredStrategy

training.py

import tensorflow as tf
strategy = \
    tf.distribute.experimental.MultiWorkerMirroredStrategy()

TF_CONFIG

{
    "cluster": {
        "worker": ["host1:port", "host2:port"],
    },
    "task": {
        "type": "worker", "index": 0
    }
}

host1

export TF_CONFIG=...
python3 training.py

host2

export TF_CONFIG=...
python3 training.py
2019-10-18 19:22:33.199624: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache
        for job worker -> {0 -> localhost:5555, 1 -> 10.2.2.66:5555}
2019-10-18 19:22:33.206039: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:365] Started server with target: grpc://localhost:5555

Horovod

In a nutshell

Concepts

bit.ly/2PWJhzs

Tensor Fusion

Batching of small allreduce ops.

Usage with tf.keras 2.0

import horovod.tensorflow.keras as hvd
hvd.init()
gpus = tf.config.experimental.list_physical_devices("GPU")
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
tf.config.experimental.set_visible_devices(
    [gpus[hvd.local_rank()]], "GPU")

Usage with tf.keras 2.0

learning_rate *= hvd.size()
opt = hvd.DistributedOptimizer(opt)
model.compile(..., experimental_run_tf_function=False)
callbacks.append(
    hvd.callbacks.BroadcastGlobalVariablesCallback(0))
if hvd.rank() == 0:
    callbacks.append("""checkpoint""")

Usage with tf.keras 2.0

horovodrun -np 4 -H localhost:4 python3 entry.py
via LogicalClocks
via LogicalClocks

Are you sure? 🤔

Problems
and
limitations

GPU hangs

Intel IOMMU + NCCL = 💔

Solution

Boot Linux with intel_iommu=off or intel_iommu=soft
bit.ly/2CkRkxH

P2P

Checking P2P

I tensorflow/.../gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
I tensorflow/.../gpu_device.cc:1165]      0 1 2 3
I tensorflow/.../gpu_device.cc:1178] 0:   N Y N N
I tensorflow/.../gpu_device.cc:1178] 1:   Y N N N
I tensorflow/.../gpu_device.cc:1178] 2:   N N N Y
I tensorflow/.../gpu_device.cc:1178] 3:   N N Y N

Checking P2P

p2pBandwidthLatencyTest github.com/NVIDIA/cuda-samples
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3
     0   1.25   0.99  12.84  13.39
     1   1.01   1.37  11.21  10.36
     2  12.39  12.33   1.28   1.07
     3  10.39  10.86   1.04   1.27

Checking P2P

Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3
     0 354.15  10.28  11.18  11.16
     1  10.27 152.29  11.16  11.11
     2  11.15  10.85 355.44  10.28
     3  11.15  10.94  10.28 354.15

Network bandwidth

1gbps is not enough

Data pipeline performance

Training Imagenet in 20 min

Quick diagnostics: nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 43%   63C    P2   101W / 250W |  10909MiB / 11178MiB |     77%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 42%   60C    P2   101W / 250W |  10913MiB / 11178MiB |     22%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |                  N/A |
| 41%   60C    P2    93W / 250W |  10909MiB / 11178MiB |     92%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:83:00.0 Off |                  N/A |
| 43%   62C    P2    92W / 250W |  10913MiB / 11178MiB |     69%      Default |
+-------------------------------+----------------------+----------------------+

Quick diagnostics: top

tf.data.Dataset() \
    .from_whatever() \
    .map(augment, num_parallel_calls=N) \
    .batch(batch_size, drop_remainder=True)
PID USER PR NI   VIRT    RES    SHR S  %CPU %MEM    TIME+ COMMAND
 21 root 20  0 0.122t 0.023t 945388 R  1333  9.3 63880:08 python3

Advanced diagnostics: profile

tensorflow.org/tensorboard/tensorboard_profiling_keras

Bugs

Bugs found by myself

Kubernetes is for the bravest

Kubeflow

Summary

Summary

Thank you

bit.ly/2NHMhNo