Scaling modern deep learning workloads demands coordinated placement of data and compute across device meshes, memory hierarchies, and heterogeneous accelerators. We present Axe Layout, a hardware-aware abstraction that maps logical tensor coordinates to a multi-axis physical space via named axes. Axe unifies tiling, sharding, replication, and offsets across inter-device distribution and on-device layouts, enabling collective primitives to be expressed consistently from device meshes to threads. Building on Axe, we design a multi-granularity, distribution-aware DSL and compiler that composes thread-local control with collective operators in a single kernel. Experiments show that our unified approach can bring performance close to hand-tuned kernels on across latest GPU devices and multi-device environments and accelerator backends.
翻译:扩展现代深度学习工作负载需要在设备网格、内存层次结构和异构加速器之间协调数据与计算资源的布局。我们提出Axe布局,这是一种硬件感知的抽象,通过命名轴将逻辑张量坐标映射到多轴物理空间。Axe在设备间分布与设备内布局中统一了分块、分片、复制和偏移操作,使得从设备网格到线程的集体原语能够以一致方式表达。基于Axe,我们设计了一个多粒度、分布感知的领域特定语言及编译器,可在单个内核中将线程局部控制与集体算子相结合。实验表明,我们的统一方法能够在最新GPU设备、多设备环境及加速器后端上实现接近手工调优内核的性能水平。