Efficient tensor computation is a cornerstone of modern deep learning (DL) workloads, yet existing approaches struggle to achieve flexible and performant design and implementation of tensor layouts -- mappings between logical tensors and hardware resources. The increasing complexity of DL algorithms and hardware demands a generic and systematic approach to handling tensor layouts. In this work, we introduce Linear Layouts, a novel approach that models tensor layouts using linear algebra over $\mathbb{F}_2$. By representing tensor layouts as binary matrices acting on the bits of the hardware representation, our approach enables a generic layout definition -- as opposed to the classical case-by-case approach -- and allows for generic layout-to-layout conversions, eliminating the quadratic explosion that plagues existing solutions. We integrate linear layouts with Triton and demonstrate their effectiveness in optimizing individual Triton operators as well as kernels written in Triton. We also show that linear layouts reduce engineering effort in the compiler backend while fixing several bugs in Triton's legacy layout system.
翻译:高效张量计算是现代深度学习工作负载的基石,然而现有方法在实现灵活且高性能的张量布局——即逻辑张量与硬件资源之间的映射——的设计与实现方面仍面临挑战。深度学习算法与硬件日益增长的复杂性要求一种通用且系统化的张量布局处理方法。本文提出线性布局,一种利用$\mathbb{F}_2$上的线性代数对张量布局进行建模的新方法。通过将张量布局表示为作用于硬件表示比特位的二元矩阵,我们的方法实现了通用布局定义——与传统的逐案例处理方法相反——并支持通用布局间转换,从而消除了困扰现有解决方案的二次复杂度爆炸问题。我们将线性布局与Triton集成,并展示了其在优化单个Triton算子及Triton编写的内核方面的有效性。实验还表明,线性布局在修复Triton传统布局系统中若干错误的同时,显著降低了编译器后端的工程开销。