跳到主要内容

L2 Arch -- 硬件架构层

功能概述

L2 定义 5 级硬件层级结构,从 Pod (集群) 到 Core (计算核心):

Pod (集群)
+-- Rack (机柜)
+-- Board (节点/服务器)
+-- Chip (加速器)
+-- Core (计算核心)
+-- Compute Units (Cube, Vector)

不在范围: 不做性能评估、并行切分。

模块清单

模块职责
chip.pyChipSpecImpl (峰值算力, 内存, 微架构)
core.pyCoreSpecImpl (计算核心规格)
compute.pyComputeSpec (Cube/Vector 单元)
memory.pyMemoryHierarchyImpl (GMEM, LMEM)
interconnect.pyInterconnectSpecImpl (NoC, c2c)
dma.pyDMAEngineImpl (GDMA, SDMA)
board.pyBoardSpecImpl (N chips per board)
rack.pyRackSpecImpl (M boards per rack)
pod.pyPodSpecImpl (K racks per pod)
topology.pyTopologySpec (4 级通信参数) + TopologySpecImpl
protocols.py硬件抽象协议

ChipSpecImpl

核心能力

class ChipSpecImpl:
name: str # "SG2262"
core_count: int # 64
frequency_ghz: float # 1.5
tiu_frequency_ghz: float # 1.0 (TIU 单独时钟)
interconnect: InterconnectSpecImpl # NoC 规格

def get_peak_flops(dtype: str, unit: str) -> float:
"""峰值算力 (FLOPS)
dtype: "BF16", "FP8", "INT8", ...
unit: "cube" | "vector"
"""

def get_gmem_bandwidth() -> float:
"""全局内存有效带宽 (GB/s) = bandwidth_gbps * utilization"""

def get_total_sram() -> int:
"""总 SRAM 容量 (bytes) = core_count * lmem_capacity_kb * 1024"""

@classmethod
def from_config(name, config: dict) -> ChipSpecImpl:
"""从 YAML 配置构建,所有字段缺失时 raise ValueError"""

芯片配置格式 (YAML)

以 SG2262 为例(完整格式见 07-configs.md):

name: SG2262
architecture: TPU_V7
process: 7nm
frequency_ghz: 1.5
tiu_frequency_ghz: 1.0
align_bytes: 32
compute_dma_overlap_rate: 0.8 # 计算/DMA 重叠率
compute_efficiency: 0.9 # 计算效率

cores:
count: 64 # 核心数
lanes_per_core: 16 # 每核 SIMD Lane 数

compute_units:
cube: # GEMM 矩阵计算单元
m: 16; k: 32; n: 8
mac_per_lane:
FP8: 256; BF16: 128; INT8: 256; TF32: 64; INT4: 512
vector: # 向量计算单元
eu_per_lane:
BF16: 32; FP16: 32; FP32: 16; INT8: 64

memory:
gmem: # 全局内存 (片外 DRAM)
type: LPDDR5
capacity_gb: 128
bandwidth_gbps: 8601.6
utilization: 1.0 # 有效带宽 = bandwidth_gbps * utilization
latency_ns: 100
lmem: # 核内本地 SRAM (per-core 容量)
capacity_kb: 2048
bandwidth_gbps: 2000
latency_ns: 1
utilization: 0.45 # tiling 可用的 SRAM 比例

dma_engines:
gdma: # gmem <-> lmem 搬运
bandwidth_gbps: 68
startup_latency_ns: 100
efficiency: 0.9
sdma: # 共享 DMA (芯片间搬运)
bandwidth_gbps: 64
startup_latency_ns: 120
efficiency: 0.85

hau: # 硬件加速单元 (TopK/排序)
sort_width: 16
compare_cycles: 1
init_cycles: 20

noc: # 片内网络 (核间通信)
topology: Mesh
mesh_cols: 8
mesh_rows: 8
bandwidth_gbps: 1000
latency_ns: 10

峰值算力计算

peak_flops = cores * lanes_per_core * mac_per_lane[dtype] * tiu_frequency_ghz * 1e9 * 2

示例 (SG2262, BF16, Cube):
= 64 * 16 * 128 * 1.0 * 1e9 * 2
= ~262 TFLOPS

TopologySpec

级通信参数

@dataclass
class TopologySpec:
# Chip-to-Chip (同 board 内)
c2c_bandwidth_gbps: float # 如 400 GB/s
c2c_latency_us: float # 如 0.15 us

# Board-to-Board (同 rack 内)
b2b_bandwidth_gbps: float # 如 400 GB/s
b2b_latency_us: float # 如 2.0 us

# Rack-to-Rack (同 pod 内)
r2r_bandwidth_gbps: float # 如 400 GB/s (InfiniBand)
r2r_latency_us: float # 如 3.0 us

# Pod-to-Pod (跨 pod)
p2p_bandwidth_gbps: float # 如 400 GB/s (Ethernet)
p2p_latency_us: float # 如 5.0 us

# 附加延迟参数 (从 topology YAML comm_params 注入)
memory_read_latency_us: float
memory_write_latency_us: float
noc_latency_us: float
die_to_die_latency_us: float
switch_latency_us: float
cable_latency_us: float

拓扑配置格式 (YAML)

拓扑 YAML 只存层级结构和互联参数,不内嵌芯片规格(芯片规格由 chips/ 目录单独管理,运行时动态注入):

name: P1-R1-B1-C8

pods:
- count: 1
racks:
- count: 1
boards:
- count: 1
chips:
- name: SG2262 # 引用 chips/SG2262.yaml
preset_id: sg2262 # 可选, 用于配置加载
count: 8

interconnect:
links:
c2c: { bandwidth_gbps: 400, latency_us: 0.15 }
b2b: { bandwidth_gbps: 400, latency_us: 2.0 }
r2r: { bandwidth_gbps: 400, latency_us: 3.0 }
p2p: { bandwidth_gbps: 400, latency_us: 5.0 }
comm_params:
allreduce_algorithm: "ring"
alltoall_algorithm: "pairwise"
enable_compute_comm_overlap: true
network_efficiency: 0.85
bandwidth_utilization: 0.95
sync_latency_us: 0
switch_latency_us: 1.0
cable_latency_us: 0.025
memory_read_latency_us: 0.15
memory_write_latency_us: 0.01
noc_latency_us: 0.05
die_to_die_latency_us: 0.04

TopologySpecImpl

路径解析

@dataclass
class TopologySpecImpl:
pods: dict[str, list[str]] # pod_id -> rack_ids
racks: dict[str, list[str]] # rack_id -> board_ids
boards: dict[str, list[str]] # board_id -> chip_ids
chips: list[str] # 所有 chip_id
link_profiles: dict[str, LinkProfileImpl] # 链路参数

def resolve_path(src_chip, dst_chip) -> tuple[str, int]:
"""解析两个 chip 间的路径键与跳数
- 同 board: ("c2c", 1)
- 同 rack 不同 board: ("b2b", 2)
- 同 pod 不同 rack: ("r2r", 3)
- 跨 pod: ("p2p", 4)
"""

Board / Rack / Pod

BoardSpecImpl

@dataclass
class BoardSpecImpl:
board_id: str
chips: list[ChipSpecImpl] # 板上所有芯片
chip_interconnect: LinkProfileImpl # 芯片间互联 (c2c)

def get_total_compute() -> float:
"""板卡总算力"""
def get_allreduce_time(data_bytes, algorithm) -> float:
"""板内 AllReduce 时间"""

RackSpecImpl

@dataclass
class RackSpecImpl:
rack_id: str
boards: list[BoardSpecImpl]
b2b_bandwidth_gbps: float
b2b_latency_us: float

PodSpecImpl

@dataclass
class PodSpecImpl:
pod_id: str
racks: list[RackSpecImpl]
r2r_bandwidth_gbps: float
r2r_latency_us: float

硬件参数合并

L4 评估层需要一个扁平的 dict[str, float] 作为硬件参数。 merge_specs() 将三个 Spec 合并:

def merge_specs(
hardware: HardwareSpec, # 芯片级 (compute_tflops, memory_bw, ...)
topology: TopologySpec, # 通信参数 (c2c/b2b/r2r/p2p bw+lat)
comm_protocol: CommProtocolSpec, # 协议参数 (bw_utilization, sync_lat)
) -> dict[str, float]:
return {**hw.to_dict(), **topo.to_dict(), **comm.to_dict()}

输出 dict 包含约 25 个 key,被 L4 的代价模型和评估器直接使用(缺失时 KeyError,不使用默认值)。