Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels

The attention mechanism is the dominant computational bottleneck in modern transformer-based AI. Its standard implementation incurs quadratic memory traffic in the sequence length~$n$, and DRAM accesses cost 100--1000$\times$ more energy than arithmetic operations on contemporary hardware, so any analysis focused solely on FLOP counts fundamentally mischaracterises the bottleneck. We present a Mathematics of Arrays (MoA) reformulation of scaled dot-product attention and its numerically stable softmax, deriving a Denotational Normal Form (DNF) that eliminates all intermediate arrays -- including the implicit transposed-key buffer and every softmax temporary -- by algebraic construction rather than empirical tuning. The DNF achieves $O(n_{dk} + n{_{dv}})$ data movement versus $O(n^2 + n_{dk} + n_{dv})$ for the standard implementation, where $n$ is the sequence length, $dk$ is the key dimensionality and $dv$ the value dimensionality, and is verified numerically against PyTorch at full double-precision floating-point on concrete inputs. Unlike hardware-specific accelerators or empirical tiling schemes such as FlashAttention, MoA simultaneously provides array fusion, shape-transformation correctness, and predictive cost models from a single algebraic framework. Memory minimality is a theorem established before any code is written. A predictive performance model projects $2$--$100\times$ speedup and $2$--$50\times$ energy reduction, with the advantage widening at exascale. The derivation establishes a formally verified pipeline from Python specification through (ONF) Operational Normal Form, and dimension-lifted hardware mapping, providing performance-portable AI kernels of direct relevance to DARPA edge-deployment and DOE exascale priorities.

翻译：注意力机制是现代基于Transformer的人工智能中的主要计算瓶颈。其标准实现产生的序列长度~$n$的二次方内存流量，而在当代硬件上，DRAM访问的能耗比算术运算高100--1000$\times$，因此任何仅关注FLOP计数的分析从根本上错误描述了瓶颈。我们提出了一种标量化点积注意力及其数值稳定softmax的数组数学（MoA）重构，通过代数构造而非经验调优推导出一种消除所有中间数组——包括隐式转置键缓冲区和每个softmax临时变量——的指称范式（DNF）。该DNF实现了$O(n_{dk} + n{_{dv}})$的数据移动，而标准实现为$O(n^2 + n_{dk} + n_{dv})$，其中$n$为序列长度，$dk$为键维度，$dv$为值维度，并在具体输入上通过PyTorch全双精度浮点数值验证。与硬件专用加速器或经验性分块方案（如FlashAttention）不同，MoA通过单一代数框架同时提供数组融合、形状变换正确性和预测性成本模型。内存最小性是在任何代码编写之前即已确立的定理。预测性性能模型预估$2$--$100$\times$加速和$2$--$50$\times$能耗降低，且在百亿亿次规模下优势进一步扩大。该推导建立了从Python规范通过操作范式（ONF）到维度提升硬件映射的形式化验证流水线，为直接相关于DARPA边缘部署和DOE百亿亿次优先任务提供可移植性能的AI内核。