Perf Model 总体架构设计
概述
perf_model 是一个 LLM 推理部署性能建模框架,采用 6 层分层架构 (L0-L5), 实现从模型定义到性能评估的完整管线。核心目标:
- 精确建模: 对齐 CHIPMathica 方法论,支持 Chip/Core/Lane 多精度评估
- 双路径支持: Math 代数模型(快速)与 G5 指令级仿真(精确)两种评估策略
- 类型安全: 使用
EvalConfigdataclass 作为单一配置入口,消除无类型 dict 传递 - 无静默默认值: 所有必需参数缺失时 raise 异常,杜绝配置错误被掩盖
- 可扩展性: Registry 模式支持新增算子、芯片、评估器
分层架构
Frontend (React + TypeScript + Three.js)
|
| REST API / WebSocket
v
+----- L0: Entry & Orchestration -----+
| api/ / engine.py / eval_config.py |
| config_loader / tasks / websocket |
+--------------------------------------+
|
| EvalConfig (typed dataclass)
v
+----- L1: Workload Modeling ----------+
| DeepSeekV3Model -> WorkloadIR |
| Layer/Op/TensorDesc/ComputeSpec |
+--------------------------------------+
|
| WorkloadIR (Model + Graph)
v
+----- L2: Hardware Architecture ------+
| ChipSpecImpl / TopologySpec |
| Pod / Rack / Board / Chip / Core |
+--------------------------------------+
|
| ChipSpec + TopologySpec
v
+----- L3: Mapping (共用) ------------+
| ParallelismPlanner |
| -> DistributedModel |
| |
| [Math 路径] [G5 路径] |
| TilingPlanner InstructionEmit |
| -> ExecPlan -> CoreProgram |
+--------------------------------------+
| |
v v
+-- L4: Math Eval --+ +-- L4: G5 Eval --+
| EvaluationEngine | | G5SimEngine |
| ChipCostModel | | 事件驱动仿真 |
+-------------------+ +-----------------+
| |
+--------+-----------+
| EngineResult (统一格式)
v
+----- L5: Reporting & Analysis ------+
| ReportingEngine -> ReportingReport |
| CostAnalyzer / Gantt / Roofline |
+--------------------------------------+
|
v
Frontend Visualization
核心数据流
1. Frontend JSON Request
|
2. L0: run_evaluation(eval_config: EvalConfig)
| - 按 eval_config.mode 路由到 Math 或 G5 管线
|
3. L1: DeepSeekV3Model.from_model_config(eval_config.model)
| - 构建 Layer 列表 (Embedding, MLA, FFN/MoE, LMHead)
| - 生成 WorkloadIR (Model + Graph)
|
4. L2: ChipSpecImpl.from_config(name, chip_config)
| - 加载芯片微架构参数 (TFLOPS, 内存, Cube 尺寸, NoC 等)
|
5. L3 共用: ParallelismPlanner.plan(ir)
| - 按 PP 划分 stage, 按 TP/EP 切分 op
| - 插入通信算子 (AllReduce, AllGather, P2P, All2All)
| -> DistributedModel
|
6a. [Math 路径]
| L3: TilingPlanner.plan(dist_model) -> TilePlan
| L3: Scheduler.plan(dist_model, tile_plan) -> ExecPlan
| L4: EvaluationEngine.evaluate(exec_plan) -> EngineResult
|
6b. [G5 路径]
| L3: G5InstructionEmitter.emit(dist_model) -> CoreProgram
| L4: G5SimEngine.simulate(core_program) -> EngineResult
|
7. L5: ReportingEngine.run(engine_result)
| - 装配性能报告
| - 计算部署成本 (CostAnalyzer)
| - 生成 Gantt/Roofline/内存/流量分析
|
8. 返回前端 -> 可视化
配置管线 (EvalConfig)
设计原则
- 单一转换点: dict -> EvalConfig 只在
build_eval_config()发生一次 - 类型安全: 全管线传递 EvalConfig dataclass,不再传递
dict[str, Any] - 无默认值: 所有必需字段缺失时 raise ValueError
- 数据不丢失: topology/comm 参数直接注入
_build_hardware_spec()
EvalConfig 结构
@dataclass
class EvalConfig:
mode: str # "math" | "g5"
model: ModelConfig # 模型参数 (含 MLA/MoE 嵌套配置)
chip_config: dict # 芯片 raw dict (给 ChipSpecImpl)
topology: TopologyOverrides # 4 级带宽/延迟 + comm 延迟参数
comm: CommOverrides # 通信协议参数 (bw_utilization, sync_lat)
deployment: DeploymentConfig # 并行策略 (TP/PP/DP/EP/MoE_TP)
board: BoardConfig # 板卡规格 (num_chips, memory)
inference: InferenceConfig # 推理参数 (batch, seq_len, dtype)
raw_model_config: dict # 报告/快照用
raw_topology_config: dict # 报告/快照用
目录结构
backend/perf_model/
+-- main.py # FastAPI 入口 (port 8003)
+-- __init__.py # 公开 API(延迟导入)
+-- L0_entry/ # 入口层
| +-- engine.py # 评估编排器
| +-- eval_config.py # EvalConfig + build_eval_config()
| +-- config_loader.py # YAML 预设加载
| +-- config_schema.py # Pydantic 验证模型
| +-- types.py # DataType, ParallelMode 枚举
| +-- tasks.py # 任务队列管理
| +-- websocket.py # WebSocket 推送
| +-- compat.py # 前端兼容层
| +-- topology_format.py # 拓扑格式转换
| +-- api/ # REST 端点 (10 个子模块)
| +-- storage/database.py # SQLAlchemy ORM
+-- L1_workload/ # 负载建模层
| +-- ir.py # WorkloadIR 协议
| +-- layer.py / op.py / tensor.py / specs.py
| +-- models/llm/ # 模型实现 (deepseek.py, llama.py)
| +-- layers/ # 层实现 (MLA, FFN, MoE, ...)
| +-- operators/ # 算子实现
+-- L2_arch/ # 硬件架构层
| +-- chip.py # ChipSpecImpl
| +-- board.py / rack.py / pod.py
| +-- topology.py # TopologySpec + TopologySpecImpl
| +-- core.py / memory.py / interconnect.py / dma.py
+-- L3_mapping/ # 映射层
| +-- common/
| | +-- parallelism/ # ParallelismPlanner(两路共用)
| | +-- plan/distributed_model.py
| +-- math/ # Math 路径
| | +-- tiling/ # TilingPlanner
| | +-- scheduling/ # Scheduler
| | +-- plan/exec_plan.py
| +-- g5/ # G5 路径
| +-- instruction_emitter.py
| +-- instruction_tiler.py
| +-- program.py # CoreProgram
+-- L4_evaluation/ # 评估层
| +-- common/
| | +-- metrics.py # EngineResult, StepMetrics, Aggregates
| +-- math/ # Math 评估
| | +-- engine.py # EvaluationEngine
| | +-- evaluators/ # 代价评估器 (compute/comm/precise)
| | +-- cost_models/ # Chip/Core/CommProtocol 代价模型
| +-- g5/ # G5 评估
| +-- pipeline.py # G5 管线封装
| +-- sim_engine.py # 仿真引擎入口
| +-- adapter.py # SimRecord -> EngineResult 适配
| +-- kernel/ # 仿真内核 (sim_kernel, sim_record, stats)
| +-- chip/ top/ # 芯片与顶层建模
| +-- tiu.py dma.py sdma.py hau.py memory.py
+-- L5_reporting/ # 报告层
| +-- engine.py / assembler.py / models.py / schema.py
| +-- cost_analysis.py / memory_analysis.py
| +-- roofline.py / gantt.py / traffic_analysis.py / exporters.py
+-- configs/ # 预设配置
+-- chips/ # 芯片 YAML (SG2262.yaml 等)
+-- models/ # 模型 YAML (DeepSeek-V3-*.yaml 等)
+-- topologies/ # 拓扑 YAML (P1-R1-B1-C8.yaml 等)
+-- benchmarks/ # 测试场景 YAML
依赖边界
L0 -> L1, L2, L3, L4, L5 (编排全流程)
L1 -> 无外部依赖 (纯负载建模)
L2 -> 无外部依赖 (纯硬件描述)
L3/common -> L1 (WorkloadIR), L2 (ChipSpec)
L3/math -> L3/common, L4/math (PreciseTileEvaluator 精评估)
L3/g5 -> L3/common
L4/math -> L2 (TopologySpec), L3/math (ExecPlan, DistributedModel)
L4/g5 -> L2, L3/g5 (CoreProgram)
L5 -> L4/common (EngineResult)
注意: L3/math 的 TilingPlanner 依赖 L4/math 的 PreciseTileEvaluator 做精评估, 这是有意的设计——TilingPlanner 在候选枚举阶段需要 L4 的精确评估能力。
支持的模型架构
| 模型 | 类型 | 特殊架构 | 实现 |
|---|---|---|---|
| DeepSeek V3 | MoE | MLA + MoE (256E/8A) | DeepSeekV3Model |
| DeepSeek R1 | MoE | MLA + MoE | DeepSeekV3Model (配置区分) |
| Qwen 3-235B | MoE | GQA + MoE | DeepSeekV3Model (适配) |
| LLaMA | Dense | GQA | LlamaModel |
支持的并行策略
| 策略 | 缩写 | 切分维度 | 通信模式 |
|---|---|---|---|
| Tensor Parallelism | TP | hidden_size | AllReduce / ReduceScatter+AllGather |
| Pipeline Parallelism | PP | num_layers | P2P Send/Recv |
| Data Parallelism | DP | batch_size | AllReduce (梯度) |
| Expert Parallelism | EP | num_experts | All2All (dispatch/combine) |
| Sequence Parallelism | SP | seq_len | AllGather / ReduceScatter |
| MoE TP | MoE_TP | expert 内部 TP | AllReduce |