Video diffusion has quickly grown into a key generative serving workload, yet producing each clip demands many denoising iterations over large spatio-temporal latents, which puts low-latency inference out of reach on a single device. A denoising step is therefore typically distributed across multiple accelerators, and TPU sub-slices have become an attractive and practical fabric for doing so. Current auto-parallel systems, however, search almost exclusively over logical device meshes and disregard how a chosen sharding is actually laid out on the physical TPU interconnect -- an oversight that leaves large, topology-dependent performance on the table. We address this gap with AoiZora, a compiler-mediated topology planner built for low-latency video diffusion inference on TPU sub-slices. Its guiding principle is to reconnect logical sharding with physical placement by drawing on different points in the compilation flow: AoiZora first eliminates weak sharding candidates from inexpensive pre-compilation IRs, then compiles only the ones that survive and orders their physical placements using compiled HLO together with a topology-aware communication model. The winning plan is realized along the ordinary compiler path, leaving model code, compiler lowering, collective kernels, and network routing entirely intact. On TPU v5e sub-slices, AoiZora reduces Wan 2.1 one-step denoising latency by as much as 1.42x relative to existing solutions.
翻译:视频扩散已迅速成为一项关键的生成式服务负载,然而生成每个片段需要对大规模时空隐变量进行多次去噪迭代,这使得在单个设备上实现低延迟推理遥不可及。因此,去噪步骤通常分布在多个加速器上,而TPU子切片已成为实现这一目标的有吸引力和实用的结构。然而,当前的自动并行系统几乎完全在逻辑设备网格上搜索,忽视了所选分片方案在实际TPU互连上的物理布局方式——这一疏忽导致大量依赖拓扑的性能潜力未被利用。我们通过AoiZora来弥补这一空白,这是一个专为TPU子切片上的低延迟视频扩散推理构建的编译器中介拓扑规划器。其指导原则是利用编译流程中的不同节点重新连接逻辑分片与物理布局:AoiZora首先通过廉价的预编译IR消除薄弱的分片候选方案,然后仅编译通过筛选的方案,并利用编译后的HLO以及拓扑感知通信模型对其物理布局进行排序。最终胜出的方案通过常规编译器路径实现,完全不改动模型代码、编译器降级过程、集合通信内核及网络路由。在TPU v5e子切片上,AoiZora将Wan 2.1的单步去噪延迟相比现有方案降低了1.42倍。