The most widely used artificial intelligence (AI) models today are Transformers employing self-attention. In its standard form, self-attention incurs costs that increase with context length, driving demand for storage, compute, and energy that is now outstripping society's ability to provide them. To help address this issue, we show that self-attention is efficiently computable to arbitrary precision with constant cost per token, achieving orders-of-magnitude reductions in memory use and computation. We derive our formulation by decomposing the conventional formulation's Taylor expansion into expressions over symmetric chains of tensor products. We exploit their symmetry to obtain feed-forward transformations that efficiently map queries and keys to coordinates in a minimal polynomial-kernel feature basis. Notably, cost is fixed inversely in proportion to head size, enabling application over a greater number of heads per token than otherwise feasible. We implement our formulation and empirically validate its correctness. Our work enables unbounded token generation at modest fixed cost, substantially reducing the infrastructure and energy demands of large-scale Transformer models. The mathematical techniques we introduce are of independent interest.
翻译:当前最广泛使用的人工智能(AI)模型是采用自注意力机制的Transformer架构。在标准形式下,自注意力的计算成本随上下文长度增加而增长,导致对存储、计算和能源的需求已超过社会供给能力。为应对这一挑战,我们证明了自注意力机制能以恒定每令牌成本实现任意精度的有效计算,从而将内存使用和计算量降低数个数量级。我们通过将传统公式的泰勒展开分解为对称张量积链上的表达式,推导出这一新形式化方法。利用其对称性,我们获得了前馈变换,能高效地将查询和键映射到最小多项式核特征基的坐标中。值得注意的是,计算成本与注意力头尺寸成反比固定,使得每个令牌可应用的注意力头数量超过传统方法的极限。我们实现了该算法并实证验证了其正确性。本工作使得以适度固定成本实现无限令牌生成成为可能,显著降低了大规模Transformer模型的基础设施和能源需求。我们引入的数学方法本身也具有独立的研究价值。