Transformer training systems are built around dense linear algebra, yet a nontrivial fraction of end-to-end time is spent on surrounding memory-bound operators. Normalization, activations, residual updates, reductions, and related computations repeatedly move large intermediate tensors through global memory while performing little arithmetic, making data movement an increasingly important bottleneck in otherwise highly optimized training stacks. We introduce CODA, a GPU kernel abstraction that expresses these computations as GEMM-plus-epilogue programs. CODA is based on the observation that many Transformer operators exposed as separate framework kernels can be algebraically reparameterized to execute while a GEMM output tile remains on chip, before it is written to memory. The abstraction fixes the GEMM mainloop and exposes a small set of composable epilogue primitives for scaling, reductions, pairwise transformations, and accumulation. This constrained interface preserves the performance structure of expert-written GEMMs while remaining expressive enough to cover nearly all non-attention computation in the forward and backward pass of a standard Transformer block. Across representative Transformer workloads, both human- and LLM-authored CODA kernels achieve high performance, suggesting that GEMM-plus-epilogue programming offers a practical path toward combining framework-level productivity with hardware-level efficiency.
翻译:Transformer训练系统主要依赖稠密线性代数,但端到端计算时间中相当一部分消耗在内存受限的周边算子中。归一化、激活函数、残差更新、规约及关联计算需反复将大型中间张量经全局内存传输,而执行的计算量却极小,这使得数据搬运成为原本高度优化的训练栈中日益重要的瓶颈。我们提出CODA——一种将此类计算表达为GEMM-后记程序的GPU内核抽象。CODA基于以下观察:许多在框架层面作为独立核函数暴露的Transformer算子,可通过代数重参数化在GEMM输出块写入内存前,利用其仍在芯片上的时机执行计算。该抽象固定了GEMM主循环,并公开了一组可组合的后记原语,用于缩放、规约、逐对变换及累加。这种受限接口在保留专家手写GEMM性能结构的同时,仍具备足够表达能力以覆盖标准Transformer模块前向与反向传播中几乎全部非注意力计算。在典型Transformer工作负载上,人类与LLM编写的CODA核函数均实现高性能,这表明GEMM-后记编程范式为融合框架层生产力与硬件层效率提供了可行路径。