The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any $n \times n$ matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: https://github.com/yifanzhang-pro/HLA.
翻译:缩放点积注意力的二次计算成本是阻碍自回归语言模型扩展到长上下文的核心障碍。线性时间注意力与状态空间模型提供了可扩展的替代方案,但通常局限于基于一阶或核函数的近似,这可能限制表达能力。本文提出高阶线性注意力,这是一种因果性、流式处理机制,通过紧凑的前缀充分统计量实现高阶交互作用。在二阶情形下,HLA保持恒定大小的状态,并以线性时间计算每个词元的输出,无需实例化任何$n \\times n$矩阵。我们给出了闭式流式计算恒等式、使用两个附加摘要的严格因果掩码变体,以及基于关联扫描的块并行训练方案,该方案能精确复现串行递归的激活状态。我们进一步概述了向三阶及更高阶的扩展。这些成果共同将HLA定位为一种兼具原则性与可扩展性的基础模块,它融合了类注意力的数据依赖混合特性与现代循环架构的高效性。项目页面:https://github.com/yifanzhang-pro/HLA。