As the scale of AI training increases, a single data center (DC) is not sufficient to deliver the required computational power. Most recent approaches to…
As the scale of AI training increases, a single data center (DC) is not sufficient to deliver the required computational power. Most recent approaches to address this challenge rely on multiple data centers being co-located or geographically distributed. In a recently open-sourced feature, the NVIDIA Collective Communication Library (NCCL) is now able to communicate across multiple data centers…