拓扑路由模块设计

日期: 2026-03-10 状态: 最终方案

核心问题

给定一个 chip-to-chip 通信请求，计算出路径的端到端延迟和瓶颈带宽。

设计原则

不变量优先：Graph（节点+边）和 RoutePath（延迟+带宽）是稳定接口，跨所有实现阶段不变。可变的是路由算法实现，不是接口。
守恒律：每个"简化"决策都会将复杂度转移到别处，必须明确复杂度去了哪。
边界条件显式定义：非法配置必须报错，不做隐式猜测或静默容错。
对称性：相同结构相同处理（chip/switch 统一为 graph 节点，link_types 共享规范）。
显式高于隐式：配置文件应自描述，不依赖隐含规则或模式名称。

数据模型（不变量）

后端（Python）

@dataclass
class NodeSpec:
    id: str                          # 全局唯一短名称
    type: str                        # "chip" | "switch"
    switch_tier: str | None = None   # switch: "tor" | "spine" | "leaf"
    port_count: int | None = None    # switch 端口上限
    forwarding_latency_us: float = 0 # switch 转发延迟

@dataclass
class EdgeSpec:
    id: str
    src: str
    dst: str
    bandwidth_gbps: float            # 单方向带宽（全双工，双向各自独立）
    latency_us: float
    link_type: str                   # 引用 link_types 中的 key

@dataclass
class NetworkGraph:
    nodes: dict[str, NodeSpec]
    edges: dict[str, EdgeSpec]
    adjacency: dict[str, list[tuple[str, EdgeSpec]]]

@dataclass
class RoutePath:
    nodes: list[str]                 # 逐跳节点（含中间 switch）
    edges: list[EdgeSpec]            # 逐跳边
    hop_latencies_us: list[float]    # 每跳延迟
    hop_bandwidths_gbps: list[float] # 每跳带宽
    total_latency_us: float
    bottleneck_bw_gbps: float
    single_flow_model: bool = True

关键接口：

def build_network_graph(config: dict) -> NetworkGraph
def validate_network_graph(graph: NetworkGraph) -> list[str]
def dijkstra_route(graph: NetworkGraph, src: str, dst: str) -> RoutePath | None

前端（TypeScript）

interface NetworkGraphNode {
  id: string
  label: string
  type: 'chip' | 'switch'
  switchTier?: 'tor' | 'spine' | 'leaf'
  portCount?: number
}

interface NetworkGraphEdge {
  id: string
  source: string
  target: string
  bandwidthGbps: number
  latencyUs: number
  linkType: string
}

interface RoutePath {
  nodes: string[]
  edges: NetworkGraphEdge[]
  hopLatenciesUs: number[]
  hopBandwidthsGbps: number[]
  totalLatencyUs: number
  bottleneckBwGbps: number
  singleFlowModel: boolean
}

YAML 配置规范

配置描述三件事：有哪些节点、怎么连、怎么路由。

pods 与 network 的职责分离：

pods 描述物理布局与约束空间——有多少芯片、它们在哪个 board/rack/pod 上。物理位置决定了可用的互联类型（同 board 可走 c2c，跨 rack 只能走 r2r）。同时服务于 3D 可视化和并行策略映射。
network 描述逻辑连接与路由——芯片之间实际怎么连。同一组芯片可以全连接、环形、星型加 switch，这是设计选择，不由物理布局决定。

这种分离支持拓扑探索：固定 pods（物理资源不变），修改 network.links（尝试不同连接方案），对比性能差异。

芯片短名称

芯片从 pods 按展开顺序自动编号 c0, c1, ..., cN：

pods:
  - count: 1
    racks:
      - count: 2
        boards:
          - count: 1
            chips:
              - name: SG2262
                count: 4
                preset_id: sg2262
# 展开结果：
# rack 0, board 0: c0 c1 c2 c3
# rack 1, board 0: c4 c5 c6 c7

短名称的优势：

写链路简洁（c0 vs pod_0/rack_0/board_0/chip_0）
网络拓扑与物理位置解耦——换连接方式只改 links，不改芯片名
系统维护 short ID → 物理位置的映射表

完整配置示例

name: "clos-2rack-8chip"

pods:
  - count: 1
    racks:
      - count: 2
        boards:
          - count: 1
            chips:
              - name: SG2262
                count: 4
                preset_id: sg2262

network:
  # bandwidth_gbps: 单方向带宽（全双工，双向各自独立）
  # 争用建模不在路由层——路由层只输出静态 bottleneck_bw，动态争用由上层评估器处理
  link_types:
    c2c: { bandwidth_gbps: 448, latency_us: 0.2 }
    b2b: { bandwidth_gbps: 400, latency_us: 1.0 }
    r2r: { bandwidth_gbps: 400, latency_us: 3.0 }

  switches:
    - { id: s0, port_count: 32, forwarding_latency_ns: 150 }
    - { id: s1, port_count: 32, forwarding_latency_ns: 150 }
    - { id: s2, port_count: 64, forwarding_latency_ns: 300 }

  links:
    # board 内全连接（笛卡尔积 + self-loop 自动排除）
    - { from: [c0, c1, c2, c3], to: [c0, c1, c2, c3], type: c2c }
    - { from: [c4, c5, c6, c7], to: [c4, c5, c6, c7], type: c2c }

    # 芯片 → ToR switch
    - { from: [c0, c1, c2, c3], to: s0, type: b2b }
    - { from: [c4, c5, c6, c7], to: s1, type: b2b }

    # ToR → Spine switch
    - { from: [s0, s1], to: s2, type: r2r }

  routing:
    algorithm: shortest_path
    weight: latency_us
    bandwidth_utilization: 0.90

links 语法

每条 link 是一个 from/to 笛卡尔积：from 中每个节点连接 to 中每个节点。

# 单条边
- { from: c0, to: c1, type: c2c }

# 一对多
- { from: c0, to: [c1, c2, c3], type: c2c }

# 多对一（星型）
- { from: [c0, c1, c2, c3], to: s0, type: b2b }

# 多对多（bipartite）
- { from: [s0, s1], to: [s2, s3], type: r2r }

from/to 支持以下值：

单个节点 ID：c0、s0
节点 ID 列表：[c0, c1, c2, c3]

全部使用 YAML 原生类型（字符串或列表），无自定义语法，无解析歧义。

每条 link 必须指定 type，引用 link_types 中的 key 获取 bandwidth/latency。链路参数集中在 link_types 管理，links 只描述连接关系。

所有边是无向的（表示双向连通）：from: c0, to: c1 等价于 from: c1, to: c0。bandwidth_gbps 为单方向带宽，全双工链路双向各自独立，互不干扰。

拓扑探索示例

同一组芯片，只改 links 即可探索不同拓扑：

# 方案 A：全连接（无 switch）
links:
  - { from: [c0, c1, c2, c3, c4, c5, c6, c7], to: [c0, c1, c2, c3, c4, c5, c6, c7], type: c2c }

# 方案 B：环形
links:
  - { from: c0, to: c1, type: c2c }
  - { from: c1, to: c2, type: c2c }
  - { from: c2, to: c3, type: c2c }
  - { from: c3, to: c4, type: c2c }
  - { from: c4, to: c5, type: c2c }
  - { from: c5, to: c6, type: c2c }
  - { from: c6, to: c7, type: c2c }
  - { from: c7, to: c0, type: c2c }

# 方案 C：单 switch 星型
links:
  - { from: [c0, c1, c2, c3, c4, c5, c6, c7], to: s0, type: b2b }

# 方案 D：Clos 二层
links:
  - { from: [c0, c1, c2, c3], to: s0, type: b2b }
  - { from: [c4, c5, c6, c7], to: s1, type: b2b }
  - { from: [s0, s1], to: s2, type: r2r }

与旧配置的关系

旧配置中的 interconnect、switch_config、connections 全部废弃，由 network 统一替代。

旧字段	新位置
`interconnect.links.c2c/b2b/r2r/p2p`	`network.link_types`
`interconnect.comm_params.bandwidth_utilization`	`network.routing.bandwidth_utilization`
`interconnect.comm_params.switch_latency_us`	每个 switch 的 `forwarding_latency_ns`
`interconnect.comm_params.memory_`, `noc_`	芯片预设 YAML（chip 级属性）
`switch_config`	`network.switches` + `network.links`
`connections`	`network.links`

节点 ID 规范

节点类型	ID 格式	示例
chip	`c{N}`	`c0`, `c1`, `c31`
switch	用户自定义短名称	`s0`, `tor0`, `spine0`

芯片 ID 由 pods 展开顺序决定，系统维护映射表：

c0  → pod_0/rack_0/board_0/chip_0
c1  → pod_0/rack_0/board_0/chip_1
...
c8  → pod_0/rack_1/board_0/chip_0

解析与校验流程

1. 加载 YAML
2. 检查 network section 存在性 → 不存在则 raise ValueError
3. 展开 pods → 按顺序分配芯片短名称 c0..cN
4. 构建 chip 节点
5. 构建 switch 节点（from network.switches）
6. 展开 links：
   a. 解析 from/to → 笛卡尔积生成边
   b. 查找 type → link_types 获取参数（或用内联参数）
   c. 排除 self-loop（from == to）
7. 校验：
   a. switch 实际连接数 <= port_count
   b. 所有 type 引用均在 link_types 中存在
   c. 无孤立 chip 节点
   d. 所有 chip 可达（连通性 BFS）
   e. from/to 中所有 ID 均为已知节点
8. 输出 NetworkGraph

边界条件

情况	行为
0 个 chip	raise ValueError
switch 端口数 = 实际连接数	合法
switch 端口数 < 实际连接数	raise ValueError
link 引用不存在的节点	raise ValueError
link 引用不存在的 link_type	raise ValueError
chip 孤立（无任何边）	raise ValueError
src == dst 路由请求	返回空路径
无路径可达	返回 None

路由引擎

阶段一：Dijkstra 最短路径

权重：latency_us（链路延迟 + switch 转发延迟）
同时计算 bottleneck_bw_gbps（路径中最小带宽）
输出带 single_flow_model: true 标记

阶段二：接口化 + ECMP

route() 内部按 algorithm 分派，接口不变
新增 ecmp 算法

阶段三：高级路由

拥塞感知路由
Elephant/Mouse 分流策略

前端集成

阶段一（最小接入）

新增 NetworkGraphNode/NetworkGraphEdge/RoutePath 类型
从 YAML network 构建 GraphModel
在 2D 拓扑图中显示 switch 节点
queryRoute() API 调用与路径显示

阶段二（渲染解耦 + switch_config 迁移）

将 switch_config 的网络语义迁移到 network 段
switch_config 降级为纯渲染配置（display 子字段：position, u_height 等）
前端 UI 编辑 switch/link 时直接操作 network 段

阶段三（可选）

换用 React Flow / G6 做拓扑图渲染

前后端同步原则

存储源：YAML（后端持久化）
运行时源：GraphModel（前端消费）
前端图编辑保存：switch → network.switches，边 → network.links

验收标准

加载 Clos 示例配置 → 成功构图
任意 chip 对 → 输出正确 RoutePath
switch 端口超限 → 报错阻断
link 引用不存在节点/类型 → 报错
前端可显示 switch 节点与逐跳路径
不同拓扑方案（全连接/环形/星型/Clos）均可通过仅修改 links 实现

核心问题​

设计原则​

数据模型（不变量）​

后端（Python）​

前端（TypeScript）​

YAML 配置规范​

芯片短名称​

完整配置示例​

links 语法​

拓扑探索示例​

与旧配置的关系​

节点 ID 规范​

解析与校验流程​

边界条件​

路由引擎​

阶段一：Dijkstra 最短路径​

阶段二：接口化 + ECMP​

阶段三：高级路由​

前端集成​

阶段一（最小接入）​

阶段二（渲染解耦 + switch_config 迁移）​

阶段三（可选）​

前后端同步原则​

验收标准​