🚦 Distributed Training

1 GPU is not enough anymore? 😭

Training huge deep learning models often requires multiple GPUs. Let's break down how DataParallel and DistributedDataParallel work, and peek at other parallelism strategies.

How DataParallel (DP) Works

Single process: one Python program handles all GPUs at once.

Algorithm:

Data is loaded on CPU.
The process copies the model to each GPU.
Batch is split into N sub-batches (one per GPU).
Each GPU computes forward and backward passes.
Gradients are gathered on GPU0 (main), weights updated, then copied back to all GPUs.

🙃 Cons: GPU0 can become a bottleneck, and extra memory copies between GPUs slow things down.

How DistributedDataParallel (DDP) Works

Multiple processes: one per GPU.
Each process loads its own copy of the model and receives its chunk of data via DistributedSampler.

Algorithm:

Processes compute forward and backward passes independently.
After .backward(), an all-reduce averages gradients across GPUs directly via NCCL (no CPU involved).
Weights are updated locally in each process; they remain identical because gradients are synchronized.

💪 Pros: No GPU0 bottleneck, less communication overhead, better scalability.

Other Parallelism Strategies

Besides DP and DDP, modern deep learning frameworks use:

Model Parallelism: split a model's layers across different devices. Useful for huge models that don't fit on a single GPU.
Tensor Parallelism: split individual layers' tensors across multiple GPUs. Often used in large transformer models like GPT or LLaMA.
Pipeline Parallelism: split a model into sequential stages and process micro-batches through a pipeline. Helps improve throughput for very deep models.

In practice, DDP + a mix of tensor or pipeline parallelism is the combo for training large-scale transformers efficiently.

Published on August 22, 2025 Author: Vitaly