🚦 Distributed Training
1 GPU is not enough anymore? ðŸ˜
                                
                                Training huge deep learning models often requires multiple GPUs. Let's break down how DataParallel and DistributedDataParallel work, and peek at other parallelism strategies.
How DataParallel (DP) Works
- Single process: one Python program handles all GPUs at once.
 
Algorithm:
- Data is loaded on CPU.
 - The process copies the model to each GPU.
 - Batch is split into N sub-batches (one per GPU).
 - Each GPU computes forward and backward passes.
 - Gradients are gathered on GPU0 (main), weights updated, then copied back to all GPUs.
 
🙃 Cons: GPU0 can become a bottleneck, and extra memory copies between GPUs slow things down.
How DistributedDataParallel (DDP) Works
- Multiple processes: one per GPU.
 - Each process loads its own copy of the model and receives its chunk of data via 
DistributedSampler. 
Algorithm:
- Processes compute forward and backward passes independently.
 - After 
.backward(), an all-reduce averages gradients across GPUs directly via NCCL (no CPU involved). - Weights are updated locally in each process; they remain identical because gradients are synchronized.
 
💪 Pros: No GPU0 bottleneck, less communication overhead, better scalability.
Other Parallelism Strategies
Besides DP and DDP, modern deep learning frameworks use:
- Model Parallelism: split a model's layers across different devices. Useful for huge models that don't fit on a single GPU.
 - Tensor Parallelism: split individual layers' tensors across multiple GPUs. Often used in large transformer models like GPT or LLaMA.
 - Pipeline Parallelism: split a model into sequential stages and process micro-batches through a pipeline. Helps improve throughput for very deep models.
 
In practice, DDP + a mix of tensor or pipeline parallelism is the combo for training large-scale transformers efficiently.
                                Published on August 22, 2025
                                Author: Vitaly