Scaling modern deep learning workloads demands coordinated placement of data and compute across device meshes, memory hierarchies, and heterogeneous accelerators. We present Axe Layout, a hardware-aware abstraction that maps logical tensor coordinates to a multi-axis physical space via named axes. Axe unifies tiling, sharding, replication, and offsets across inter-device distribution and on-device layouts, enabling collective primitives to be expressed consistently from device meshes to threads. Building on Axe, we design a multi-granularity, distribution-aware DSL and compiler that composes thread-local control with collective operators in a single kernel. Experiments show that our unified approach can bring performance close to hand-tuned kernels on across latest GPU devices and multi-device environments and accelerator backends.
翻译:扩展现代深度学习工作负载需要在设备网格、内存层次结构和异构加速器之间协调数据与计算资源的布局。本文提出Axe布局——一种硬件感知的抽象机制,通过命名轴将逻辑张量坐标映射到多轴物理空间。Axe在设备间分布与设备内布局两个层面,统一了分块(tiling)、分片(sharding)、复制(replication)和偏移量(offsets)的表示,使得从设备网格到线程层的集体计算原语能够以一致方式表达。基于Axe,我们设计了一个多粒度、分布感知的领域专用语言(DSL)及其编译器,可在单个内核中融合线程局部控制流与集体运算符。实验表明,这种统一方法在最新GPU设备、多设备环境及各类加速器后端上,能够实现接近手工调优内核的性能水平。