CGX

Distributed training framework with communication compression support

CGX represents a collection of approaches for efficient integration of gradient compression in distributed DNN training frameworks.

The first approach is the clone (not public) of Horovod framework. We integrate the gradient compression logic into communication layer and add filtering of the communicated layers.

The second approach is QNCCL. Here, we integrate gradient compression logic into NCCL kernels.

The third approach is pytorch-cgx. pytorch-cgx is a plug-in extension for torch.distributed that substantially improves scalability on servers with low-bandwidth inter GPU communication. Keep your training scripts the same, only need to replace torch.distributed backend and launching script. The only changes to the training script using pytorch distributed required are importing the built extension and specifying cgx as torch.distributed.init_process_group backend parameter and setting some environment variables.

Applying CGX to distributed training on a single machine with 8xRTX3090 allows us to achieve 60% speedup compared to the Nvidia NCCL:

Transformer-XL base