跳到主要内容

L1 Workload -- 负载建模层

功能概述

L1 负责将模型定义抽象为硬件无关的 WorkloadIR,包括:

  • 模型定义: 从 ModelConfig 构建层结构 (DeepSeekV3Model, LlamaModel)
  • 层与算子: Layer 包含 Op 列表,Op 携带 FLOPs/字节/shape 信息
  • IR 输出: WorkloadIR 提供统一的分析接口 (OpsBreakdown, MemoryFootprint)

不在范围: 不涉及硬件参数、并行切分、性能评估。

模块清单

模块职责
ir.pyWorkloadIR 协议 + Model 实现
layer.pyLayer / Module dataclass
op.pyOp dataclass
tensor.pyTensorDesc (shape, dtype, bytes)
specs.pyComputeSpec, MemorySpec, CommSpec
metadata.pyModelMetadata
graph.pyWorkloadGraph (节点 + 边)
breakdown.pyOpsBreakdown, MemoryFootprint
models/llm/deepseek.pyDeepSeek V3/R1/V3.2 模型
models/llm/llama.pyLLaMA 模型
layers/各层实现
operators/各算子实现

核心数据结构

WorkloadIR 协议

class WorkloadIR(Protocol):
def get_layers(self) -> list[Layer]: ...
def get_ops_breakdown(self) -> OpsBreakdown: ...
def get_memory_footprint(self) -> MemoryFootprint: ...
def get_communication_pattern(self) -> CommPattern: ...

Layer

@dataclass
class Layer:
name: str # "layers.0.mla"
op_type: str # "mla", "ffn", "moe", "embedding"
inputs: list[TensorDesc] # 输入张量描述
outputs: list[TensorDesc] # 输出张量描述
ops: list[Op] # 计算算子列表
comm: CommSpec | None # 分布式通信模式提示
params: dict[str, Any] # 层参数 (hidden_size, num_heads, ...)

Op

@dataclass
class Op:
name: str # "layers.0.mla.q_proj"
op_type: str # "matmul", "softmax", "rmsnorm"
compute: ComputeSpec # cube_ops, vector_ops, scalar_ops, hau_ops
memory: MemorySpec # weight_bytes, activation_bytes, read/write
inputs: list[TensorDesc]
outputs: list[TensorDesc]
attrs: dict[str, Any] # 额外属性 (dtype_bytes, local_weight_bytes)

ComputeSpec / MemorySpec

@dataclass
class ComputeSpec:
cube_ops: int # GEMM 运算量 (FLOPs)
vector_ops: int # 向量运算量
scalar_ops: int # 标量运算量
hau_ops: int # HAU 运算量 (TopK/排序)

@dataclass
class MemorySpec:
weight_bytes: int # 权重大小
activation_bytes: int # 激活大小
read_bytes: int # 总读取量
write_bytes: int # 总写入量

DeepSeek V3 模型

模型结构

Embedding
|
Dense Layers (x num_dense_layers):
+-- MLA (Multi-head Latent Attention)
+-- FFN (Dense Feed-Forward)
|
MoE Layers (x num_moe_layers):
+-- MLA (Multi-head Latent Attention)
+-- MoE (Mixture of Experts)
|
LMHead

from_model_config() classmethod

将结构化 ModelConfig 转为 layer 层期望的 flat dict:

@classmethod
def from_model_config(cls, mc: ModelConfig) -> DeepSeekV3Model:
config = {
"hidden_size": mc.hidden_size,
"num_heads": mc.num_attention_heads, # key rename
"q_lora_rank": mc.mla.q_lora_rank, # 从嵌套提取
"n_routed_experts": mc.moe.num_routed_experts, # key rename
"mla_mode": mc.mla.mla_mode, # "standard"|"absorb"|"auto"
# ... 完整映射
}
return cls(config)

MLA (Multi-head Latent Attention)

DeepSeek V3 特有的注意力机制:

  • LoRA 压缩: Q 用 q_lora_rank=1536 低秩投影,KV 用 kv_lora_rank=512
  • RoPE 混合: qk_rope_head_dim=64 (旋转) + qk_nope_head_dim=128 (非旋转)
  • KV Cache 高效: 压缩后 KV Cache 减小 5-8x
  • 三种模式 (由 mla_mode 控制):
    • standard: 标准 MLA,完整 QKV 投影
    • absorb: Absorbed 变体,合并 KV LoRA 投影(decode 阶段更高效)
    • auto: 根据 is_prefill 自动选择

MoE (Mixture of Experts)

  • 路由: num_routed_experts=256, num_activated_experts=8 (Top-8)
  • 共享专家: num_shared_experts=1 (始终激活)
  • 通信: EP 并行需要 All2All dispatch/combine
  • FFN 结构: 每个专家包含独立的 gate_proj, up_proj, down_proj

层实现清单

文件包含的算子
EmbeddingLayerlayers/embedding.pymatmul (token -> hidden)
MLALayerlayers/mla.pyQ/KV LoRA proj, QK matmul, softmax, V matmul, output proj
MLAAbsorbLayerlayers/mla_absorb.pyAbsorbed variant for decode
FFNLayerlayers/ffn.pygate_proj, up_proj, SiLU, down_proj
MoELayerlayers/moe.pyrouter, dispatch, expert FFN x activated, combine
LMHeadLayerlayers/lmhead.pymatmul (hidden -> vocab)
DSALayerlayers/dsa.pyDynamic Sparse Attention

Op 计算量估算

MatMul (GEMM)

FLOPs = 2 * M * N * K
Weight bytes = K * N * dtype_bytes
Activation bytes = M * K * dtype_bytes
Output bytes = M * N * dtype_bytes

Softmax

Vector ops = 5 * seq_len * seq_len * num_heads
Memory: 2 * seq_len * seq_len * num_heads * dtype_bytes (读+写)

RMSNorm

Vector ops = 3 * batch * seq_len * hidden_size
Memory: 2 * batch * seq_len * hidden_size * dtype_bytes

通信模式

Layer 通过 CommSpec 声明分布式通信需求:

@dataclass
class CommSpec:
pattern: str # "allreduce", "allgather", "p2p", "all2all"
tensor_bytes: int # 通信数据量
participants: int # 参与芯片数
scope: str # "tp_group", "dp_group", "ep_group"

实际通信算子由 L3 ParallelismPlanner 插入。