跳到主要内容

L5 Reporting -- 报告与可视化层

功能概述

L5 是评估结果的唯一出口,负责:

  • 将 L4 EngineResult 汇总为结构化 ReportingReport
  • 成本分析 (服务器 + 互联 + 运营)
  • 内存分解 (权重 / KV Cache / 激活)
  • Roofline 性能画像 (算力/带宽瓶颈)
  • Gantt 时序图 (计算/通信/等待分段)
  • 链路流量分析 (c2c/b2b/r2r/p2p 利用率)
  • JSON 导出

不在范围: 不做评估计算 (由 L4),不做前端渲染。

模块清单

模块职责
engine.pyReportingEngine (统一入口)
assembler.pyReportingAssembler (指标汇总)
models.pyPerformanceSummary, BottleneckSummary, ReportingReport, OutputConfig
schema.pySCHEMA_VERSION
cost_analysis.pyCostAnalyzer, CostBreakdown
memory_analysis.pyMemoryAnalyzer, MemoryBreakdown
roofline.pyRooflineAnalyzer, RooflineData, RooflinePoint
gantt.pyGanttChartBuilder, GanttChartData, GanttTask
traffic_analysis.pyTrafficAnalyzer, TrafficReport, LinkTraffic
exporters.pyJSONExporter, ExporterRegistry

整体架构

L4 EngineResult (StepMetrics + Aggregates)
|
v
ReportingEngine
|
+--------+--------+
v v v
Assembler CostAnalyzer MemoryAnalyzer
(汇总) (成本) (内存)
| | |
v v v
ReportingReport CostBreakdown MemoryBreakdown
|
+--------+--------+
v v v
Roofline Gantt Traffic
(性能) (时序) (流量)
|
v
JSONExporter

ReportingEngine

接口

class ReportingEngine:
def run(
self,
engine_result: EngineResult,
config: dict | None = None,
output_config: OutputConfig | None = None,
) -> ReportingReport

def build_text(self, report: ReportingReport) -> ReportText

def export(
self,
report: ReportingReport,
output_config: OutputConfig | None = None,
filename: str = "reporting_report.json",
) -> str

处理流程

  1. 输入校验: 检查 EngineResult 包含有效 aggregates
  2. 指标汇总: 调用 ReportingAssembler.assemble()
  3. 文本生成: build_text() 生成可读文本报告
  4. JSON 导出: export() 输出到文件

ReportingAssembler

汇总逻辑

class ReportingAssembler:
def assemble(
self,
engine_result: EngineResult,
config: dict | None = None,
include_step_metrics: bool = True,
) -> ReportingReport

转换流程:

  1. Aggregates -> PerformanceSummary (字段映射,添加 _ms 时间单位后缀)
  2. StepMetrics[] -> BottleneckSummary (按类型计数 + Top-5 耗时 Op)
  3. 可选附带完整 step_metrics 列表

输出结构

@dataclass
class PerformanceSummary:
total_time_ms: float # 对应 Aggregates.total_time
ttft_ms: float # 对应 Aggregates.ttft
tpot_ms: float # 对应 Aggregates.tpot
tps: float
mfu: float
mbu: float
memory_peak_mb: float
compute_time_ms: float # 对应 Aggregates.total_compute_time
comm_time_ms: float # 对应 Aggregates.total_comm_time
wait_time_ms: float
total_flops: int
total_bytes: int
num_ops: int

@dataclass
class BottleneckSummary:
compute_bound_count: int
bw_bound_count: int
latency_bound_count: int
unknown_count: int
top_ops: list[dict] # Top-5 耗时 Op

CostAnalyzer

成本模型

class CostAnalyzer:
def __init__(
self,
chip_prices: dict[str, float] | None = None,
modules_per_server: int = 8,
chips_per_module: int = 1,
depreciation_years: int = 3,
)

def analyze(
self,
chip_type: str,
chip_count: int,
tps: float = 0.0,
lanes_per_chip: int = 16,
) -> CostBreakdown

def analyze_from_pod(
self,
pod: PodSpec,
aggregates: Aggregates | None = None,
) -> CostBreakdown

成本公式

# 服务器成本 (单台)
server_cost = (chip_price * chips_per_module + 750) * modules_per_server + 12000 + 7500

# 互联成本
interconnect_cost = chip_count * modules_per_server * chips_per_module * lanes * lane_cost(chip_count)

# 总成本
total_cost = server_cost + interconnect_cost

# 百万 token 成本 (3 年折旧)
cost_per_million_tokens = total_cost / (depreciation_years * 365 * 24 * 3600 * tps) * 1e6

互联成本阶梯

芯片数单 lane 成本互联方案
1-2$1/lanePCIe 直连
8$55/laneEthernet 交换
16$70/lane交换 + DAC
32$70/lane交换 + DAC
64$105/lane交换 + AEC
64+$247/lane全光方案 (AOC + 光模块)

CostBreakdown

@dataclass
class CostBreakdown:
server_cost: float
interconnect_cost: float
total_cost: float
cost_per_chip: float
cost_per_million_tokens: float
chip_count: int
chip_type: str
depreciation_years: int

MemoryAnalyzer

内存分解

class MemoryAnalyzer:
def analyze(
self,
hidden_size: int,
num_layers: int,
num_heads: int,
intermediate_size: int,
vocab_size: int,
batch_size: int = 1,
seq_len: int = 1024,
num_kv_heads: int | None = None,
tp_degree: int = 1,
) -> MemoryBreakdown

计算公式

# 权重 (单层)
attention_params = 4 * hidden_size * hidden_size # Q/K/V/O
ffn_params = 3 * hidden_size * intermediate_size # gate/up/down
layer_params = attention_params + ffn_params + 2 * hidden_size # +layernorm

# 总权重
weights_bytes = (layer_params * num_layers + vocab_size * hidden_size) * dtype_bytes / tp_degree

# KV Cache
kv_cache_bytes = 2 * batch * seq_len * kv_heads * head_dim * num_layers * dtype_bytes

# 激活
activations_bytes = batch * seq_len * hidden_size * dtype_bytes * activation_factor

MemoryBreakdown

@dataclass
class MemoryBreakdown:
total_bytes: int
weights_bytes: int
kv_cache_bytes: int
activations_bytes: int

@property
def total_gb(self) -> float

RooflineAnalyzer

Roofline 模型

class RooflineAnalyzer:
def __init__(
self,
peak_flops_gflops: float, # 峰值算力 (GFLOPS)
peak_bandwidth_gbps: float, # 峰值带宽 (GB/s)
)

def analyze_point(
self,
name: str,
flops: int,
bytes_accessed: int,
time_ns: float,
) -> RooflinePoint

关键指标

AI = FLOPS / bytes_accessed           # 算术强度
ridge_point = peak_flops / peak_bw # 拐点
attainable = min(peak_flops, AI * peak_bw)

if AI < ridge_point: BW_BOUND
else: COMPUTE_BOUND

RooflineData

@dataclass
class RooflineData:
peak_flops: float
peak_bandwidth: float
ridge_point: float
points: list[RooflinePoint]
roofline_x: list[float] # AI 轴 (log scale)
roofline_y: list[float] # GFLOPS 轴 (log scale)

GanttChartBuilder

Gantt 图生成

class GanttChartBuilder:
def add_task(
self,
name: str,
start_us: float,
end_us: float,
task_type: GanttTaskType,
phase: InferencePhase,
device_id: str = "",
pp_stage: int = 0,
layer_index: int | None = None,
token_index: int | None = None,
**attrs,
) -> GanttTask

def build(self, phase_transition: float | None = None) -> GanttChartData

任务类型

类别类型颜色
计算compute, attention, ffn绿色 (#52c41a - #237804)
MLAmla_q_proj, mla_kv_proj, ...青色 (#13c2c2 - #08979c)
MoEmoe_router, moe_expert, ...品红 (#f759ab - #c41d7f)
通信tp_comm, pp_comm, ep_comm蓝/紫 (#1890ff - #722ed1)
内存hbm_read, hbm_write, kv_cache橙色 (#faad14 - #d48806)
空闲bubble, idle灰色 (#d9d9d9)

输出格式

@dataclass
class GanttChartData:
resources: list[dict] # PP stage 资源行 (compute + network)
tasks: list[GanttTask] # 任务列表
time_range: dict # {"start": 0, "end": us}
phase_transition: float # TTFT (us), prefill/decode 分界

TrafficAnalyzer

class TrafficAnalyzer:
def add_comm(
self,
src: str,
dst: str,
bytes_transferred: int,
time_us: float,
comm_type: str,
phase: str,
) -> None

def analyze(self, exec_plan: ExecPlan) -> TrafficReport

分析维度:

  • 链路流量: src -> dst 的传输量与利用率
  • 设备流量: 每芯片 send/recv 分解
  • 通信类型分解: TP_ALLREDUCE / PP_P2P / EP_ALLTOALL / ...
  • 阶段分解: prefill vs decode

L0 Compat 层

L0_entry/compat.py 负责将 L4/L5 结果转为前端兼容格式:

convert_to_gantt_chart()

def convert_to_gantt_chart(
step_metrics: list[dict],
parallelism: dict,
aggregates: dict | None = None,
topology_config: dict | None = None,
) -> dict

输出格式 (前端 Gantt 组件):

{
"resources": [
{"id": "stage0_compute", "name": "PP0 Compute", "ppStage": 0, "type": "compute"},
{"id": "stage0_network", "name": "PP0 Network", "ppStage": 0, "type": "network"}
],
"tasks": [
{
"id": "task_1",
"name": "layers.5.mla",
"resource": "stage0_compute",
"start": 1000.0,
"end": 5000.0,
"type": "attention_qkv",
"phase": "prefill",
"color": "#389e0d"
}
],
"timeRange": {"start": 0.0, "end": 50000.0},
"phaseTransition": 20000.0
}

convert_to_stats()

输出格式 (前端性能面板):

{
"prefill": {"computeTime": 120, "commTime": 25, "totalTime": 150},
"decode": {"computeTime": 80, "commTime": 15, "totalTime": 100},
"totalRunTime": 250,
"ttft": 150,
"avgTpot": 2.3,
"dynamicMfu": 0.65,
"dynamicMbu": 0.72,
"linkTrafficStats": [...]
}

JSON 导出

ReportingReport 通过 dataclasses.asdict() 序列化为 JSON,格式如下:

{
"schema_version": "1.0.0",
"timestamp": "2026-02-25T12:00:00",
"granularity": "CHIP",
"performance": {
"total_time_ms": 150.5,
"ttft_ms": 45.2,
"tpot_ms": 2.3,
"tps": 434.78,
"mfu": 0.65,
"mbu": 0.72,
"memory_peak_mb": 64512.0,
"compute_time_ms": 120.0,
"comm_time_ms": 25.0,
"wait_time_ms": 5.5,
"total_flops": 12345678900,
"total_bytes": 987654321,
"num_ops": 512
},
"bottleneck": {
"compute_bound_count": 200,
"bw_bound_count": 150,
"latency_bound_count": 50,
"unknown_count": 112,
"top_ops": [
{"op_id": "layers.5.mla", "t_total_ms": 12.34, "bottleneck": "BW_BOUND"}
]
},
"config": {},
"step_metrics": []
}