Tensor Parallelism and Pipeline Parallelism

less than 1 minute read

There are multiple parallelism strategies from video

1 Data Parallelism

This basic parallelism is split data between GPUs Alt text In the LLM era, model parallelism are actually used Callout NVLink between GPUs and InfiniBand between nodes. SXM version GPU are GPU with NVLink connections, instead of PCIe

2 Pipeline Parallelism

Also call inter-layer parallelism. (inter- means between) Alt text It has bubbles so Google paper introduced micro-batch to mitigate the time waste. NVidia introduced 1F1B in this blog

3 Tensor Parallelism

Also call intra-layer parallelism. (intra- means within) Matrix calculation can be divided, so it leads to TP algorithm Alt text

Examples can be found in the Megatron-LM paper Alt text

4 3D Parallelism

All these method can be used together and that’s the idea behind Deepspeed. Alt text

Twitter Facebook LinkedIn

You May Also Enjoy

Stream Batch process

May 31 2025

One zhihu blog popped up on my frontpage and had some discussion about streaming batch process. So I followed couple of the passages and here are some high l...

CUDA

May 21 2025

1 Concepts thread thread block, consists of warps, executed on SM(Streaming Multiprocessor) warp, is a 32 thread block. A warp is executed physically ...

Slurm and Enroot

May 19 2025

Finally touching on Slurm system. First heard about during CGG time, and we had some brief discussing of using it for cluster jobs. But our own implemention ...

NVLink, InfiniBand and SpectrumX

May 13 2025

Summary from zhihu post, which some picture from here.