L3 Mapping -- 公共层：并行规划

功能概述

L3 公共层负责将 L1 的 WorkloadIR 映射到 L2 的硬件上，生成 DistributedModel。这部分逻辑被 Math 路径和 G5 路径共用。

完成并行规划后，Math 路径进入 04b-l3-math.md 做 Tiling+Scheduling， G5 路径进入指令生成阶段（见 G5 专项文档）。

不在范围: 不做性能评估 (由 L4)，不生成报告 (由 L5)。

模块清单

模块	职责
`common/parallelism/planner.py`	ParallelismPlanner
`common/parallelism/parallel_spec.py`	ParallelSpec, ParallelType
`common/parallelism/pattern_rules.py`	并行模式规则 (embedding, MLA, FFN, MoE)
`common/plan/distributed_model.py`	DistributedModel, DistributedOp

ParallelismPlanner

输入/输出

输入: DeploymentSpec + BoardSpec + WorkloadIR
输出: DistributedModel

DeploymentSpec

@dataclass
class DeploymentSpec:
    tp: int = 1           # Tensor Parallelism
    pp: int = 1           # Pipeline Parallelism
    ep: int = 1           # Expert Parallelism
    moe_tp: int = 1       # MoE 内部 TP
    dp: int = 1           # Data Parallelism
    seq_len: int = 2048
    q_seq_len: int = 2048
    kv_seq_len: int = 2048
    batch_size: int = 1
    enable_tp_sp: bool = False
    enable_ring_attention: bool = False
    enable_zigzag: bool = False    # Zigzag 流水线优化
    embed_tp: int = 1
    lmhead_tp: int = 1
    comm_protocol: int = 1
    kv_cache_rate: float = 0.0
    is_prefill: bool = False

计算流程

约束校验: 验证 dp * tp == moe_tp * ep (MoE 层)
PP 分组: 按 pp 将 layers 划分为 stages
Op 切分: 按 pattern rules 为每个 Op 选择 ParallelSpec
通信插入: 在切分边界插入通信算子
输出: DistributedModel (计算 + 通信 Op 有向图)

Pattern Rules

层类型	TP 策略	通信
Embedding	按 embed_tp 切分 vocab	AllGather (后续需完整 hidden)
MLA Q/KV proj	按 TP 切分 output_dim	-
MLA QK matmul	按 TP 切分 heads	-
MLA output proj	按 TP 切分 input_dim	AllReduce (归约)
FFN gate/up	按 TP 切分 intermediate	-
FFN down	按 TP 切分 input_dim	AllReduce (归约)
MoE Router	不切	-
MoE Expert FFN	按 moe_tp 切分	All2All (dispatch/combine)
LMHead	按 lmhead_tp 切分 vocab	AllGather

DistributedModel

数据结构

@dataclass
class DistributedModel:
    ops: list[DistributedOp]                       # 所有算子（有序）
    op_map: dict[str, DistributedOp]               # op_id -> op（快速查询）
    tp: int; pp: int; ep: int; num_chips: int       # 并行度
    stages: list[list[str]]                        # PP 阶段划分 (stage -> op_ids)
    parallel_groups: dict[str, list[list[int]]]    # 并行分组 (tp/pp/ep/dp -> chip_id 列表)
    rank_map: dict[int, dict[str, int]]            # chip_id -> {tp_rank, pp_rank, ...}
    graph_nodes: list[str]                         # 计算图节点
    graph_edges: list[tuple[str, str]]             # 计算图依赖边
    chip_assignments: dict[str, list[int]]         # op_id -> 分配的 chip_id 列表
    op_parallel_specs: dict[str, ParallelSpec]     # op_id -> 并行规格

DistributedOp

@dataclass
class DistributedOp:
    op_id: str
    role: NodeRole          # COMPUTE | COMM
    op_type: str            # "matmul", "allreduce", ...
    local_shape: dict[str, int]  # 切分后的 shape

    # 并行信息
    parallel_spec: ParallelSpec | None
    stage_id: int                          # PP 阶段 ID
    chip_ids: list[int]                    # 参与的 chip id 列表

    # 通信专属字段 (role=COMM)
    comm_type: CommType | None             # ALLREDUCE, ALLGATHER, ALL2ALL, P2P
    comm_bytes: int
    participants: list[int]                # 参与芯片 ID
    topology_path_key: str                 # "c2c", "b2b", "r2r", "p2p"
    reason: str | None                     # "tp_reduce", "moe_dispatch", ...
    scope: str                             # "inter_chip" | "intra_chip"
    cause: str                             # "layout_mismatch" | "tiling_reduce" | ...
    src: int | None                        # P2P 源端 chip_id
    dst: int | None                        # P2P 目的端 chip_id
    algo_hint: str | None                  # 通信算法提示 (如 "ring")
    trigger_edge_id: str | None            # 触发通信的依赖边

    # 依赖关系
    deps: list[str]                        # 依赖的 op_id 列表
    attrs: dict[str, str]                  # 其他属性

通信算子类型

CommType	用途	典型场景
ALLREDUCE	激活归约	TP 内 MLA/FFN output_proj
ALLGATHER	激活广播	Embedding 后, TP-SP
REDUCE_SCATTER	归约 + 分散	TP-SP MLA
ALL2ALL	全交换	MoE dispatch/combine
P2P	点对点	PP stage 边界

并行策略分配顺序

内到外: TP -> EP -> PP -> DP

TP 组: 优先同 board 芯片 (高带宽 c2c)
EP 组: 可跨 board
PP 组: 可跨 rack (P2P 通信)
DP 组: 可跨 pod

功能概述​

模块清单​

ParallelismPlanner​

输入/输出​

DeploymentSpec​

计算流程​

Pattern Rules​

DistributedModel​

数据结构​

DistributedOp​

通信算子类型​

并行策略分配顺序​