流水并行（Pipeline Parallelism）

流水并行（PP）将模型按层切分为多个 stage，分配到不同设备上。Narayanan et al.（SC 2021）在 Megatron-LM v2 中系统化了 PP 与 TP/DP 的混合使用方法。PP 的通信量较小，但会引入 pipeline bubble，是大模型训练中效率损失的主要来源之一。

基本原理

PP 将模型的 $L$ 个 Transformer 层平均切分为 $p$ 个 stage，每个 stage 分配到独立设备：

Stage 0          Stage 1          Stage 2          Stage 3
[Layers 0-7]  →  [Layers 8-15]  →  [Layers 16-23]  →  [Layers 24-31]
              P2P              P2P               P2P

为了提高设备利用率，将一个 global batch 切分为 $m$ 个 micro-batch，按流水线方式依次注入：

1F1B 调度（One Forward, One Backward）：在稳定阶段，每个设备交替执行一次前向和一次反向，使流水线保持满载。

时间 →
Stage 0: F0 F1 F2 F3 | B3 B2 B1 B0
Stage 1:    F0 F1 F2 F3 | B3 B2 B1 B0
Stage 2:       F0 F1 F2 F3 | B3 B2 B1 B0
Stage 3:          F0 F1 F2 F3 | B3 B2 B1 B0

通信时机

PP 的通信发生在相邻 stage 的边界处：

前向传播：stage $i$ 的输出激活值通过 P2P Send 发送给 stage $i+1$
反向传播：stage $i+1$ 的梯度通过 P2P Send 发送回 stage $i$

通信模式为线性链式 P2P，stage $i$ 只与 $i-1$ 和 $i+1$ 通信，不涉及集合通信原语。

通信量分析

每次 P2P 传输的消息大小等于激活值张量（或梯度张量）的大小：

$M_{\text{PP}} = b \times s \times h \times \text{dtype\_size}$

与 TP AllReduce 的消息大小相同（都是 $b \times s \times h$ 规模的张量），但通信语义不同：PP 是点对点 P2P，TP 是集合通信 AllReduce。

典型值（$b=1$，$s=4096$，$h=7168$，BF16）：约 58.7 MB 每次 P2P。

PP 通信特征汇总：

特征	值
通信原语	P2P Send/Recv
消息大小	10 ~ 100 MB
通信模式	线性链式（stage $i$ ↔ stage $i \pm 1$）
频率	每 micro-batch 一次（每个 stage 边界）
延迟敏感性	中等（Pipeline bubble 可部分隐藏通信延迟）
推荐拓扑	相邻 stage 高带宽连接即可（不需全连接）

Pipeline Bubble 分析

PP 的主要性能开销不是通信本身，而是 pipeline bubble——流水线启动（startup）和排空（drain）阶段的设备空闲时间。

Bubble Ratio 公式（来源：Narayanan et al., SC 2021, arXiv:2104.04473 §3.1）：

$\text{Bubble ratio} = \frac{p - 1}{m + p - 1}$

其中 $p$ 为 stage 数（流水线深度），$m$ 为 micro-batch 数。

含义：

分子 $p-1$：启动阶段（warmup）有 $p-1$ 个 slot 的 bubble
分母 $m + p - 1$：流水线的总时间 slot 数（$m$ 个 steady-state slot + $p-1$ 个 startup slot）
当 $m \gg p$ 时，公式近似为 $(p-1)/m$（常见简化，但仅限于 $m \gg p$ 场景）

典型值对比（$\alpha = t_f + t_b$，忽略 P2P 延迟）：

$p$	$m$	精确值 $(p-1)/(m+p-1)$	近似值 $(p-1)/m$	相对误差
4	8	27.3%	37.5%	+37%
8	8	46.7%	87.5%	+87%
8	16	30.4%	43.8%	+44%
4	32	8.6%	9.4%	+9%

减少 Bubble 的方法：

增大 $m$（更多 micro-batch）：$(p-1)/(m+p-1) \to 0$，但增大内存压力
减小 $p$（更少 stage）：与使用 PP 的动机相悖
Interleaved 1F1B（交错流水线）：将每个 stage 切分为 $v$ 个 chunk，Bubble ratio 降为 $\frac{p-1}{vm + p - 1}$（$v$ 为 chunk 数，约等于 $\frac{p-1}{vm}$ 当 $vm \gg p$ 时）

通信延迟对 Bubble 的影响：每次 P2P 传输的延迟 $\alpha_{\text{P2P}}$ 直接增加 bubble 宽度。相邻 stage 之间应保证低延迟连接。

参考文献

Narayanan et al., "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM", SC 2021. https://arxiv.org/abs/2104.04473
Narayanan et al., "Memory-Efficient Pipeline-Parallel DNN Training", ICML 2021. https://arxiv.org/abs/2006.09503

基本原理​

通信时机​

通信量分析​

Pipeline Bubble 分析​

参考文献​

基本原理

通信时机

通信量分析

Pipeline Bubble 分析

参考文献