🚦 Distributed Training
Explain why 1 GPU is not enough anymore ðŸ˜

Training huge deep learning models often requires multiple GPUs. Let’s break down how DataParallel and DistributedDataParallel work, and peek at other parallelism strategies.
How DataParallel (DP) Works
- Single process: one Python program handles all GPUs at once.
Algorithm:
- Data is loaded on CPU.
- The process copies the model to each GPU.
- Batch is split into N sub-batches (one per GPU).
- Each GPU computes forward and backward passes.
- Gradients are gathered on GPU0 (main), weights updated, then copied back to all GPUs.
🙃 Cons: GPU0 can become a bottleneck, and extra memory copies between GPUs slow things down.
How DistributedDataParallel (DDP) Works
- Multiple processes: one per GPU.
- Each process loads its own copy of the model and receives its chunk of data via
DistributedSampler
.
Algorithm:
- Processes compute forward and backward passes independently.
- After
.backward()
, an all-reduce averages gradients across GPUs directly via NCCL (no CPU involved). - Weights are updated locally in each process; they remain identical because gradients are synchronized.
💪 Pros: No GPU0 bottleneck, less communication overhead, better scalability.
Other Parallelism Strategies
Besides DP and DDP, modern deep learning frameworks use:
- Model Parallelism: split a model’s layers across different devices. Useful for huge models that don’t fit on a single GPU.
- Tensor Parallelism: split individual layers’ tensors across multiple GPUs. Often used in large transformer models like GPT or LLaMA.
- Pipeline Parallelism: split a model into sequential stages and process micro-batches through a pipeline. Helps improve throughput for very deep models.
In practice, DDP + a mix of tensor or pipeline parallelism is the combo for training large-scale transformers efficiently.
Published on August 22, 2025
Author: Vitaly