Vadim Markovtsev, source{d}.
Vadim Markovtsev
source{d}
tf.distribute.Strategy
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Dense(1, input_shape=(1,))])
model.compile(loss="mse", optimizer="sgd")
model.fit(dataset)
import keras tf.keras
MirroredStrategy (local)
CentralStorageStrategy (local)
TPUStrategy (local)
MultiWorkerMirroredStrategy (network)
tf.data.Dataset
shardingtf.data.Dataset
shardingtraining.py
import tensorflow as tf
strategy = \
tf.distribute.experimental.MultiWorkerMirroredStrategy()
{
"cluster": {
"worker": ["host1:port", "host2:port"],
},
"task": {
"type": "worker", "index": 0
}
}
export TF_CONFIG=...
python3 training.py
export TF_CONFIG=...
python3 training.py
2019-10-18 19:22:33.199624: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job worker -> {0 -> localhost:5555, 1 -> 10.2.2.66:5555}
2019-10-18 19:22:33.206039: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:365] Started server with target: grpc://localhost:5555
tf.keras
2.0import horovod.tensorflow.keras as hvd
hvd.init()
gpus = tf.config.experimental.list_physical_devices("GPU")
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
tf.config.experimental.set_visible_devices(
[gpus[hvd.local_rank()]], "GPU")
tf.keras
2.0learning_rate *= hvd.size()
opt = hvd.DistributedOptimizer(opt)
model.compile(..., experimental_run_tf_function=False)
callbacks.append(
hvd.callbacks.BroadcastGlobalVariablesCallback(0))
if hvd.rank() == 0:
callbacks.append("""checkpoint""")
tf.keras
2.0horovodrun -np 4 -H localhost:4 python3 entry.py
intel_iommu=off
or intel_iommu=soft
I tensorflow/.../gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: I tensorflow/.../gpu_device.cc:1165] 0 1 2 3 I tensorflow/.../gpu_device.cc:1178] 0: N Y N N I tensorflow/.../gpu_device.cc:1178] 1: Y N N N I tensorflow/.../gpu_device.cc:1178] 2: N N N Y I tensorflow/.../gpu_device.cc:1178] 3: N N Y N
p2pBandwidthLatencyTest
github.com/NVIDIA/cuda-samples
P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 1 2 3 0 1.25 0.99 12.84 13.39 1 1.01 1.37 11.21 10.36 2 12.39 12.33 1.28 1.07 3 10.39 10.86 1.04 1.27
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 1 2 3 0 354.15 10.28 11.18 11.16 1 10.27 152.29 11.16 11.11 2 11.15 10.85 355.44 10.28 3 11.15 10.94 10.28 354.15
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A | | 43% 63C P2 101W / 250W | 10909MiB / 11178MiB | 77% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A | | 42% 60C P2 101W / 250W | 10913MiB / 11178MiB | 22% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 108... Off | 00000000:82:00.0 Off | N/A | | 41% 60C P2 93W / 250W | 10909MiB / 11178MiB | 92% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 108... Off | 00000000:83:00.0 Off | N/A | | 43% 62C P2 92W / 250W | 10913MiB / 11178MiB | 69% Default | +-------------------------------+----------------------+----------------------+
tf.data.Dataset() \
.from_whatever() \
.map(augment, num_parallel_calls=N) \
.batch(batch_size, drop_remainder=True)
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21 root 20 0 0.122t 0.023t 945388 R 1333 9.3 63880:08 python3