Towards Tight Bounds for Streaming Attention

The attention mechanism is a cornerstone of modern transformer architectures. However, its expressive power comes at the cost of quadratic runtime and linear space usage. In particular, the classical transformer architecture explicitly stores all previously seen input elements (tokens) in order to generate the next one. The problem of implementing a transformer in limited space, known as KV cache compression, has received much interest over the past few years, spurring the development of powerful heuristics. Recent works of Haris et al, COLT'25 and Kochetkova et al, NeurIPS'25, formalized KV cache compression as the streaming attention approximation problem, providing both upper bounds (based on discrepancy theory) and information theoretic lower bounds. However, those papers left open a significant gap between the upper and lower bounds. For example, the space usage of their algorithms increases with the precision parameter, but the lower bound does not get stronger. In this work, we revisit the streaming attention approximation problem and provide nearly tight bounds on its space complexity. On the algorithmic side, we achieve the result through a surprisingly tight interplay between three distinct methods for kernel density estimation: discrepancy-based coreset constructions (e.g., Charikar-Kapralov-Waingarten'24), the polynomial method (e.g., Greengard-Rokhlin'87, Alman-Song'23), and space partitioning (e.g., Andoni-Laarhoven-Razenshteyn-Waingarten'17, Charikar-Kapralov-Nouri-Siminelakis'20). On the lower bound side, our main technical contribution is a new technique for using the INDEX problem with a large amount of side information that we hope will prove useful in other high dimensional geometric estimation problems.

翻译：注意力机制是现代Transformer架构的基石，但其强大的表达能力以二次时间复杂度和线性空间开销为代价。经典Transformer架构需显式存储所有历史输入元素（令牌）方能生成后续令牌。在有限空间中实现Transformer的问题（即KV缓存压缩）近年来备受关注，催生了多种高效启发式算法。Haris等人（COLT'25）与Kochetkova等人（NeurIPS'25）近期将KV缓存压缩形式化为流式注意力近似问题，并基于差异理论给出了上界与信息论下界。然而，这些工作在上界与下界之间留下了显著间隙：算法空间复杂度随精度参数增长，但下界并未随之增强。本文重新审视流式注意力近似问题，给出了其空间复杂度的近紧界。在算法层面，我们通过三种核密度估计方法的巧妙组合实现了这一结果：基于差异的核集构造（如Charikar-Kapralov-Waingarten'24）、多项式方法（如Greengard-Rokhlin'87、Alman-Song'23）以及空间划分（如Andoni-Laarhoven-Razenshteyn-Waingarten'17、Charikar-Kapralov-Nouri-Siminelakis'20）。在下界方面，我们的主要技术贡献是提出了一种利用带大量侧信息的INDEX问题的新技术，该技术有望在其他高维几何估计问题中发挥作用。