L5 Reporting -- 报告与可视化层
功能概述
L5 是评估结果的唯一出口,负责:
- 将 L4
EngineResult汇总为结构化ReportingReport - 成本分析 (服务器 + 互联 + 运营)
- 内存分解 (权重 / KV Cache / 激活)
- Roofline 性能画像 (算力/带宽瓶颈)
- Gantt 时序图 (计算/通信/等待分段)
- 链路流量分析 (c2c/b2b/r2r/p2p 利用率)
- JSON 导出
不在范围: 不做评估计算 (由 L4),不做前端渲染。
模块清单
| 模块 | 职责 |
|---|---|
engine.py | ReportingEngine (统一入口) |
assembler.py | ReportingAssembler (指标汇总) |
models.py | PerformanceSummary, BottleneckSummary, ReportingReport, OutputConfig |
schema.py | SCHEMA_VERSION |
cost_analysis.py | CostAnalyzer, CostBreakdown |
memory_analysis.py | MemoryAnalyzer, MemoryBreakdown |
roofline.py | RooflineAnalyzer, RooflineData, RooflinePoint |
gantt.py | GanttChartBuilder, GanttChartData, GanttTask |
traffic_analysis.py | TrafficAnalyzer, TrafficReport, LinkTraffic |
exporters.py | JSONExporter, ExporterRegistry |
整体架构
L4 EngineResult (StepMetrics + Aggregates)
|
v
ReportingEngine
|
+--------+--------+
v v v
Assembler CostAnalyzer MemoryAnalyzer
(汇总) (成本) (内存)
| | |
v v v
ReportingReport CostBreakdown MemoryBreakdown
|
+--------+--------+
v v v
Roofline Gantt Traffic
(性能) (时序) (流量)
|
v
JSONExporter
ReportingEngine
接口
class ReportingEngine:
def run(
self,
engine_result: EngineResult,
config: dict | None = None,
output_config: OutputConfig | None = None,
) -> ReportingReport
def build_text(self, report: ReportingReport) -> ReportText
def export(
self,
report: ReportingReport,
output_config: OutputConfig | None = None,
filename: str = "reporting_report.json",
) -> str
处理流程
- 输入校验: 检查 EngineResult 包含有效 aggregates
- 指标汇总: 调用 ReportingAssembler.assemble()
- 文本生成: build_text() 生成可读文本报告
- JSON 导出: export() 输出到文件
ReportingAssembler
汇总逻辑
class ReportingAssembler:
def assemble(
self,
engine_result: EngineResult,
config: dict | None = None,
include_step_metrics: bool = True,
) -> ReportingReport
转换流程:
Aggregates->PerformanceSummary(字段映射,添加_ms时间单位后缀)StepMetrics[]->BottleneckSummary(按类型计数 + Top-5 耗时 Op)- 可选附带完整
step_metrics列表
输出结构
@dataclass
class PerformanceSummary:
total_time_ms: float # 对应 Aggregates.total_time
ttft_ms: float # 对应 Aggregates.ttft
tpot_ms: float # 对应 Aggregates.tpot
tps: float
mfu: float
mbu: float
memory_peak_mb: float
compute_time_ms: float # 对应 Aggregates.total_compute_time
comm_time_ms: float # 对应 Aggregates.total_comm_time
wait_time_ms: float
total_flops: int
total_bytes: int
num_ops: int
@dataclass
class BottleneckSummary:
compute_bound_count: int
bw_bound_count: int
latency_bound_count: int
unknown_count: int
top_ops: list[dict] # Top-5 耗时 Op
CostAnalyzer
成本模型
class CostAnalyzer:
def __init__(
self,
chip_prices: dict[str, float] | None = None,
modules_per_server: int = 8,
chips_per_module: int = 1,
depreciation_years: int = 3,
)
def analyze(
self,
chip_type: str,
chip_count: int,
tps: float = 0.0,
lanes_per_chip: int = 16,
) -> CostBreakdown
def analyze_from_pod(
self,
pod: PodSpec,
aggregates: Aggregates | None = None,
) -> CostBreakdown
成本公式
# 服务器成本 (单台)
server_cost = (chip_price * chips_per_module + 750) * modules_per_server + 12000 + 7500
# 互联成本
interconnect_cost = chip_count * modules_per_server * chips_per_module * lanes * lane_cost(chip_count)
# 总成本
total_cost = server_cost + interconnect_cost
# 百万 token 成本 (3 年折旧)
cost_per_million_tokens = total_cost / (depreciation_years * 365 * 24 * 3600 * tps) * 1e6
互联成本阶梯
| 芯片数 | 单 lane 成本 | 互联方案 |
|---|---|---|
| 1-2 | $1/lane | PCIe 直连 |
| 8 | $55/lane | Ethernet 交换 |
| 16 | $70/lane | 交换 + DAC |
| 32 | $70/lane | 交换 + DAC |
| 64 | $105/lane | 交换 + AEC |
| 64+ | $247/lane | 全光方案 (AOC + 光模块) |
CostBreakdown
@dataclass
class CostBreakdown:
server_cost: float
interconnect_cost: float
total_cost: float
cost_per_chip: float
cost_per_million_tokens: float
chip_count: int
chip_type: str
depreciation_years: int
MemoryAnalyzer
内存分解
class MemoryAnalyzer:
def analyze(
self,
hidden_size: int,
num_layers: int,
num_heads: int,
intermediate_size: int,
vocab_size: int,
batch_size: int = 1,
seq_len: int = 1024,
num_kv_heads: int | None = None,
tp_degree: int = 1,
) -> MemoryBreakdown
计算公式
# 权重 (单层)
attention_params = 4 * hidden_size * hidden_size # Q/K/V/O
ffn_params = 3 * hidden_size * intermediate_size # gate/up/down
layer_params = attention_params + ffn_params + 2 * hidden_size # +layernorm
# 总权重
weights_bytes = (layer_params * num_layers + vocab_size * hidden_size) * dtype_bytes / tp_degree
# KV Cache
kv_cache_bytes = 2 * batch * seq_len * kv_heads * head_dim * num_layers * dtype_bytes
# 激活
activations_bytes = batch * seq_len * hidden_size * dtype_bytes * activation_factor
MemoryBreakdown
@dataclass
class MemoryBreakdown:
total_bytes: int
weights_bytes: int
kv_cache_bytes: int
activations_bytes: int
@property
def total_gb(self) -> float
RooflineAnalyzer
Roofline 模型
class RooflineAnalyzer:
def __init__(
self,
peak_flops_gflops: float, # 峰值算力 (GFLOPS)
peak_bandwidth_gbps: float, # 峰值带宽 (GB/s)
)
def analyze_point(
self,
name: str,
flops: int,
bytes_accessed: int,
time_ns: float,
) -> RooflinePoint
关键指标
AI = FLOPS / bytes_accessed # 算术强度
ridge_point = peak_flops / peak_bw # 拐点
attainable = min(peak_flops, AI * peak_bw)
if AI < ridge_point: BW_BOUND
else: COMPUTE_BOUND
RooflineData
@dataclass
class RooflineData:
peak_flops: float
peak_bandwidth: float
ridge_point: float
points: list[RooflinePoint]
roofline_x: list[float] # AI 轴 (log scale)
roofline_y: list[float] # GFLOPS 轴 (log scale)
GanttChartBuilder
Gantt 图生成
class GanttChartBuilder:
def add_task(
self,
name: str,
start_us: float,
end_us: float,
task_type: GanttTaskType,
phase: InferencePhase,
device_id: str = "",
pp_stage: int = 0,
layer_index: int | None = None,
token_index: int | None = None,
**attrs,
) -> GanttTask
def build(self, phase_transition: float | None = None) -> GanttChartData
任务类型
| 类别 | 类型 | 颜色 |
|---|---|---|
| 计算 | compute, attention, ffn | 绿色 (#52c41a - #237804) |
| MLA | mla_q_proj, mla_kv_proj, ... | 青色 (#13c2c2 - #08979c) |
| MoE | moe_router, moe_expert, ... | 品红 (#f759ab - #c41d7f) |
| 通信 | tp_comm, pp_comm, ep_comm | 蓝/紫 (#1890ff - #722ed1) |
| 内存 | hbm_read, hbm_write, kv_cache | 橙色 (#faad14 - #d48806) |
| 空闲 | bubble, idle | 灰色 (#d9d9d9) |
输出格式
@dataclass
class GanttChartData:
resources: list[dict] # PP stage 资源行 (compute + network)
tasks: list[GanttTask] # 任务列表
time_range: dict # {"start": 0, "end": us}
phase_transition: float # TTFT (us), prefill/decode 分界
TrafficAnalyzer
class TrafficAnalyzer:
def add_comm(
self,
src: str,
dst: str,
bytes_transferred: int,
time_us: float,
comm_type: str,
phase: str,
) -> None
def analyze(self, exec_plan: ExecPlan) -> TrafficReport
分析维度:
- 链路流量: src -> dst 的传输量与利用率
- 设备流量: 每芯片 send/recv 分解
- 通信类型分解: TP_ALLREDUCE / PP_P2P / EP_ALLTOALL / ...
- 阶段分解: prefill vs decode
L0 Compat 层
L0_entry/compat.py 负责将 L4/L5 结果转为前端兼容格式:
convert_to_gantt_chart()
def convert_to_gantt_chart(
step_metrics: list[dict],
parallelism: dict,
aggregates: dict | None = None,
topology_config: dict | None = None,
) -> dict
输出格式 (前端 Gantt 组件):
{
"resources": [
{"id": "stage0_compute", "name": "PP0 Compute", "ppStage": 0, "type": "compute"},
{"id": "stage0_network", "name": "PP0 Network", "ppStage": 0, "type": "network"}
],
"tasks": [
{
"id": "task_1",
"name": "layers.5.mla",
"resource": "stage0_compute",
"start": 1000.0,
"end": 5000.0,
"type": "attention_qkv",
"phase": "prefill",
"color": "#389e0d"
}
],
"timeRange": {"start": 0.0, "end": 50000.0},
"phaseTransition": 20000.0
}
convert_to_stats()
输出格式 (前端性能面板):
{
"prefill": {"computeTime": 120, "commTime": 25, "totalTime": 150},
"decode": {"computeTime": 80, "commTime": 15, "totalTime": 100},
"totalRunTime": 250,
"ttft": 150,
"avgTpot": 2.3,
"dynamicMfu": 0.65,
"dynamicMbu": 0.72,
"linkTrafficStats": [...]
}
JSON 导出
ReportingReport 通过 dataclasses.asdict() 序列化为 JSON,格式如下:
{
"schema_version": "1.0.0",
"timestamp": "2026-02-25T12:00:00",
"granularity": "CHIP",
"performance": {
"total_time_ms": 150.5,
"ttft_ms": 45.2,
"tpot_ms": 2.3,
"tps": 434.78,
"mfu": 0.65,
"mbu": 0.72,
"memory_peak_mb": 64512.0,
"compute_time_ms": 120.0,
"comm_time_ms": 25.0,
"wait_time_ms": 5.5,
"total_flops": 12345678900,
"total_bytes": 987654321,
"num_ops": 512
},
"bottleneck": {
"compute_bound_count": 200,
"bw_bound_count": 150,
"latency_bound_count": 50,
"unknown_count": 112,
"top_ops": [
{"op_id": "layers.5.mla", "t_total_ms": 12.34, "bottleneck": "BW_BOUND"}
]
},
"config": {},
"step_metrics": []
}