High Performance Networking for Distributed DL Training in Production K8s

Nivedita Viswanath, Vatsan Kasturi at KubeCon + CloudNativeCon North America 2020

Distributed DL training requires high performance networks connecting tens, hundreds, or for certain natural language processing models, even thousands of GPUs. Running these workloads on Kubernetes clusters of GPU enhanced servers requires careful engineering to avoid bottlenecks at NIC and switching fabric that act as interconnect between nodes. In this presentation we will describe the design and architecture of a 800 GPU cluster interconnected over RoCE fabric to achieve line rate performance between communicating containers in a multi-node job. Some of the topics we will cover are scalable cookie-cutter POD design for DC, low latency one hop network design that enables NCCL rings to avoid output port congestion and K8s integration with a multi-homed network for optimal GPU utilization. We will share performance numbers for training workloads from our production clusters.