L2 Arch -- 硬件架构层
功能概述
L2 定义 5 级硬件层级结构,从 Pod (集群) 到 Core (计算核心):
Pod (集群)
+-- Rack (机柜)
+-- Board (节点/服务器)
+-- Chip (加速器)
+-- Core (计算核心)
+-- Compute Units (Cube, Vector)
不在范围: 不做性能评估、并行切分。
模块清单
| 模块 | 职责 |
|---|---|
chip.py | ChipSpecImpl (峰值算力, 内存, 微架构) |
core.py | CoreSpecImpl (计算核心规格) |
compute.py | ComputeSpec (Cube/Vector 单元) |
memory.py | MemoryHierarchyImpl (GMEM, LMEM) |
interconnect.py | InterconnectSpecImpl (NoC, c2c) |
dma.py | DMAEngineImpl (GDMA, SDMA) |
board.py | BoardSpecImpl (N chips per board) |
rack.py | RackSpecImpl (M boards per rack) |
pod.py | PodSpecImpl (K racks per pod) |
topology.py | TopologySpec (4 级通信参数) + TopologySpecImpl |
protocols.py | 硬件抽象协议 |
ChipSpecImpl
核心能力
class ChipSpecImpl:
name: str # "SG2262"
core_count: int # 64
frequency_ghz: float # 1.5
tiu_frequency_ghz: float # 1.0 (TIU 单独时钟)
interconnect: InterconnectSpecImpl # NoC 规格
def get_peak_flops(dtype: str, unit: str) -> float:
"""峰值算力 (FLOPS)
dtype: "BF16", "FP8", "INT8", ...
unit: "cube" | "vector"
"""
def get_gmem_bandwidth() -> float:
"""全局内存有效带宽 (GB/s) = bandwidth_gbps * utilization"""
def get_total_sram() -> int:
"""总 SRAM 容量 (bytes) = core_count * lmem_capacity_kb * 1024"""
@classmethod
def from_config(name, config: dict) -> ChipSpecImpl:
"""从 YAML 配置构建,所有字段缺失时 raise ValueError"""
芯片配置格式 (YAML)
以 SG2262 为例(完整格式见 07-configs.md):
name: SG2262
architecture: TPU_V7
process: 7nm
frequency_ghz: 1.5
tiu_frequency_ghz: 1.0
align_bytes: 32
compute_dma_overlap_rate: 0.8 # 计算/DMA 重叠率
compute_efficiency: 0.9 # 计算效率
cores:
count: 64 # 核心数
lanes_per_core: 16 # 每核 SIMD Lane 数
compute_units:
cube: # GEMM 矩阵计算单元
m: 16; k: 32; n: 8
mac_per_lane:
FP8: 256; BF16: 128; INT8: 256; TF32: 64; INT4: 512
vector: # 向量计算单元
eu_per_lane:
BF16: 32; FP16: 32; FP32: 16; INT8: 64
memory:
gmem: # 全局内存 (片外 DRAM)
type: LPDDR5
capacity_gb: 128
bandwidth_gbps: 8601.6
utilization: 1.0 # 有效带宽 = bandwidth_gbps * utilization
latency_ns: 100
lmem: # 核内本地 SRAM (per-core 容量)
capacity_kb: 2048
bandwidth_gbps: 2000
latency_ns: 1
utilization: 0.45 # tiling 可用的 SRAM 比例
dma_engines:
gdma: # gmem <-> lmem 搬运
bandwidth_gbps: 68
startup_latency_ns: 100
efficiency: 0.9
sdma: # 共享 DMA (芯片间搬运)
bandwidth_gbps: 64
startup_latency_ns: 120
efficiency: 0.85
hau: # 硬件加速单元 (TopK/排序)
sort_width: 16
compare_cycles: 1
init_cycles: 20
noc: # 片内网络 (核间通信)
topology: Mesh
mesh_cols: 8
mesh_rows: 8
bandwidth_gbps: 1000
latency_ns: 10
峰值算力计算
peak_flops = cores * lanes_per_core * mac_per_lane[dtype] * tiu_frequency_ghz * 1e9 * 2
示例 (SG2262, BF16, Cube):
= 64 * 16 * 128 * 1.0 * 1e9 * 2
= ~262 TFLOPS
TopologySpec
级通信参数
@dataclass
class TopologySpec:
# Chip-to-Chip (同 board 内)
c2c_bandwidth_gbps: float # 如 400 GB/s
c2c_latency_us: float # 如 0.15 us
# Board-to-Board (同 rack 内)
b2b_bandwidth_gbps: float # 如 400 GB/s
b2b_latency_us: float # 如 2.0 us
# Rack-to-Rack (同 pod 内)
r2r_bandwidth_gbps: float # 如 400 GB/s (InfiniBand)
r2r_latency_us: float # 如 3.0 us
# Pod-to-Pod (跨 pod)
p2p_bandwidth_gbps: float # 如 400 GB/s (Ethernet)
p2p_latency_us: float # 如 5.0 us
# 附加延迟参数 (从 topology YAML comm_params 注入)
memory_read_latency_us: float
memory_write_latency_us: float
noc_latency_us: float
die_to_die_latency_us: float
switch_latency_us: float
cable_latency_us: float
拓扑配置格式 (YAML)
拓扑 YAML 只存层级结构和互联参数,不内嵌芯片规格(芯片规格由 chips/ 目录单独管理,运行时动态注入):
name: P1-R1-B1-C8
pods:
- count: 1
racks:
- count: 1
boards:
- count: 1
chips:
- name: SG2262 # 引用 chips/SG2262.yaml
preset_id: sg2262 # 可选, 用于配置加载
count: 8
interconnect:
links:
c2c: { bandwidth_gbps: 400, latency_us: 0.15 }
b2b: { bandwidth_gbps: 400, latency_us: 2.0 }
r2r: { bandwidth_gbps: 400, latency_us: 3.0 }
p2p: { bandwidth_gbps: 400, latency_us: 5.0 }
comm_params:
allreduce_algorithm: "ring"
alltoall_algorithm: "pairwise"
enable_compute_comm_overlap: true
network_efficiency: 0.85
bandwidth_utilization: 0.95
sync_latency_us: 0
switch_latency_us: 1.0
cable_latency_us: 0.025
memory_read_latency_us: 0.15
memory_write_latency_us: 0.01
noc_latency_us: 0.05
die_to_die_latency_us: 0.04
TopologySpecImpl
路径解析
@dataclass
class TopologySpecImpl:
pods: dict[str, list[str]] # pod_id -> rack_ids
racks: dict[str, list[str]] # rack_id -> board_ids
boards: dict[str, list[str]] # board_id -> chip_ids
chips: list[str] # 所有 chip_id
link_profiles: dict[str, LinkProfileImpl] # 链路参数
def resolve_path(src_chip, dst_chip) -> tuple[str, int]:
"""解析两个 chip 间的路径键与跳数
- 同 board: ("c2c", 1)
- 同 rack 不同 board: ("b2b", 2)
- 同 pod 不同 rack: ("r2r", 3)
- 跨 pod: ("p2p", 4)
"""
Board / Rack / Pod
BoardSpecImpl
@dataclass
class BoardSpecImpl:
board_id: str
chips: list[ChipSpecImpl] # 板上所有芯片
chip_interconnect: LinkProfileImpl # 芯片间互联 (c2c)
def get_total_compute() -> float:
"""板卡总算力"""
def get_allreduce_time(data_bytes, algorithm) -> float:
"""板内 AllReduce 时间"""
RackSpecImpl
@dataclass
class RackSpecImpl:
rack_id: str
boards: list[BoardSpecImpl]
b2b_bandwidth_gbps: float
b2b_latency_us: float
PodSpecImpl
@dataclass
class PodSpecImpl:
pod_id: str
racks: list[RackSpecImpl]
r2r_bandwidth_gbps: float
r2r_latency_us: float
硬件参数合并
L4 评估层需要一个扁平的 dict[str, float] 作为硬件参数。
merge_specs() 将三个 Spec 合并:
def merge_specs(
hardware: HardwareSpec, # 芯片级 (compute_tflops, memory_bw, ...)
topology: TopologySpec, # 通信参数 (c2c/b2b/r2r/p2p bw+lat)
comm_protocol: CommProtocolSpec, # 协议参数 (bw_utilization, sync_lat)
) -> dict[str, float]:
return {**hw.to_dict(), **topo.to_dict(), **comm.to_dict()}
输出 dict 包含约 25 个 key,被 L4 的代价模型和评估器直接使用(缺失时 KeyError,不使用默认值)。