Modern architectures for high-performance computing and deep learning increasingly incorporate specialized tensor instructions, including tensor cores for matrix multiplication and hardware-optimized copy operations for multi-dimensional data. These instructions prescribe fixed, often complex data layouts that must be correctly propagated through the entire execution pipeline to ensure both correctness and optimal performance. We present CuTe, a novel mathematical specification for representing and manipulating tensors. CuTe introduces two key innovations: (1) a hierarchical layout representation that directly extends traditional flat-shape and flat-stride tensor representations, enabling the representation of complex mappings required by modern hardware instructions, and (2) a rich algebra of layout operations -- including concatenation, coalescence, composition, complementation, division, tiling, and inversion -- that enables sophisticated layout manipulation, derivation, verification, and static analysis. CuTe layouts provide a framework for managing both data layouts and thread arrangements in GPU kernels, while the layout algebra enables powerful compile-time reasoning about layout properties and the expression of generic tensor transformations. In this work, we demonstrate that CuTe's abstractions significantly aid software development compared to traditional approaches, promote compile-time verification of architecturally prescribed layouts, facilitate the implementation of algorithmic primitives that generalize to a wide range of applications, and enable the concise expression of tiling and partitioning patterns required by modern specialized tensor instructions. CuTe has been successfully deployed in production systems, forming the foundation of NVIDIA's CUTLASS library and a number of related efforts including CuTe DSL.
翻译:现代高性能计算与深度学习架构日益融合专用张量指令,包括用于矩阵乘法的张量核心以及针对多维数据的硬件优化复制操作。这些指令规定了固定且通常复杂的数据布局,必须通过整个执行流水线正确传播以确保正确性和最优性能。本文提出CuTe——一种用于表示和操作张量的新型数学规范。CuTe引入两大关键创新:(1) 分层布局表示法,直接扩展传统的平坦形状与平坦步幅张量表示,能够表达现代硬件指令所需的复杂映射关系;(2) 丰富的布局运算代数——包括拼接、合并、组合、补集、除法、分块与求逆——支持复杂的布局操作、推导、验证与静态分析。CuTe布局为管理GPU核函数中的数据布局与线程排列提供了框架,而布局代数则支持对布局属性的强大编译时推理及通用张量变换的表达。本研究表明,相较于传统方法,CuTe抽象显著助力软件开发,促进架构规定布局的编译时验证,推动可泛化至广泛应用的算法原语实现,并能简洁表达现代专用张量指令所需的分块与分区模式。CuTe已成功部署于生产系统,成为NVIDIA CUTLASS库及包括CuTe DSL在内多项相关工作的基础。