🚦 Distributed Training

Explain why 1 GPU is not enough anymore 😭

Distributed illustration

Training huge deep learning models often requires multiple GPUs. Let’s break down how DataParallel and DistributedDataParallel work, and peek at other parallelism strategies.

How DataParallel (DP) Works

  • Single process: one Python program handles all GPUs at once.

Algorithm:

  1. Data is loaded on CPU.
  2. The process copies the model to each GPU.
  3. Batch is split into N sub-batches (one per GPU).
  4. Each GPU computes forward and backward passes.
  5. Gradients are gathered on GPU0 (main), weights updated, then copied back to all GPUs.

🙃 Cons: GPU0 can become a bottleneck, and extra memory copies between GPUs slow things down.

How DistributedDataParallel (DDP) Works

  • Multiple processes: one per GPU.
  • Each process loads its own copy of the model and receives its chunk of data via DistributedSampler.

Algorithm:

  1. Processes compute forward and backward passes independently.
  2. After .backward(), an all-reduce averages gradients across GPUs directly via NCCL (no CPU involved).
  3. Weights are updated locally in each process; they remain identical because gradients are synchronized.

💪 Pros: No GPU0 bottleneck, less communication overhead, better scalability.

Other Parallelism Strategies

Besides DP and DDP, modern deep learning frameworks use:

  • Model Parallelism: split a model’s layers across different devices. Useful for huge models that don’t fit on a single GPU.
  • Tensor Parallelism: split individual layers’ tensors across multiple GPUs. Often used in large transformer models like GPT or LLaMA.
  • Pipeline Parallelism: split a model into sequential stages and process micro-batches through a pipeline. Helps improve throughput for very deep models.

In practice, DDP + a mix of tensor or pipeline parallelism is the combo for training large-scale transformers efficiently.

Published on August 22, 2025 Author: Vitaly