Large language models spend most of their inference cost on attention over long contexts, yet empirical behavior suggests that only a small subset of tokens meaningfully contributes to each query. We formalize this phenomenon by modeling attention as a projection onto the convex hull of key vectors and analyzing its entropic (softmax-like) relaxation. Our main theoretical contribution is a face-stability theorem showing that, under a strict complementarity margin (a support gap (Δ) certified by KKT multipliers), entropic attention concentrates on a constant-size active face: the total mass assigned to inactive tokens decays exponentially as (\exp(-Ω(Δ/\varepsilon))), while the error on the active face scales linearly in the temperature/regularization parameter (\varepsilon). This yields a practical criterion for when sparse long-context decoding is safe and provides a principled knob to trade accuracy for compute. Building on these guarantees, we introduce Vashista Sparse Attention, a drop-in mechanism that maintains a small candidate set per query through a paging-style context selection strategy compatible with modern inference stacks. Across long-context evaluations, we observe stable constant-size effective support, strong wall-clock speedups, and minimal quality degradation in the regimes predicted by the support-gap diagnostics. Finally, we discuss deployment implications for privacy-sensitive and air-gapped settings, where interchangeable attention modules enable predictable latency and cost without external retrieval dependencies.
翻译:大型语言模型在长上下文注意力计算上消耗了大部分推理成本,然而实证研究表明,仅有一小部分词元对每个查询具有实质性贡献。我们通过将注意力建模为键向量凸包上的投影并分析其熵式(类softmax)松弛来形式化这一现象。我们的主要理论贡献是面稳定性定理:在严格互补裕度(由KKT乘子验证的支持间隙Δ)条件下,熵式注意力会集中在恒定尺寸的活跃面上——分配给非活跃词元的总质量以指数形式衰减(exp(-Ω(Δ/ε))),而活跃面上的误差随温度/正则化参数ε呈线性缩放。这为稀疏长上下文解码的安全性提供了实用判据,并建立了精度与计算量权衡的理论调节机制。基于这些保证,我们提出了Vashista稀疏注意力机制,该即插即用模块通过兼容现代推理框架的分页式上下文选择策略,为每个查询维持小型候选集。在长上下文评估中,我们观察到稳定的恒定尺寸有效支持、显著的实时加速效果,以及在支持间隙诊断所预测的区间内可忽略的质量损失。最后,我们讨论了该机制在隐私敏感和物理隔离场景下的部署意义,其中可互换的注意力模块能够在不依赖外部检索的情况下实现可预测的延迟与成本。