We present the Condensate Theorem: attention sparsity is a learned topological property, not an architectural constraint. Through empirical analysis of trained language models, we find that attention mass concentrates on a distinct topological manifold -- and this manifold can be identified dynamically without checking every position. We prove a general result: for any query, projecting attention onto the Condensate Manifold (Anchor + Window + Dynamic Top-k) achieves 100% output equivalence with full $O(n^2)$ attention. This is not an approximation -- it is lossless parity. We validate this across GPT-2, Pythia, Qwen2, TinyLlama, and Mistral, demonstrating bit-exact token matching on 1,500+ generated tokens. By mapping this topology to hardware, our Topological Attention kernel achieves a 159x measured speedup at 131K tokens (3.94ms vs 628ms) and a projected >1,200x speedup at 1M tokens, reducing inference costs by >99.9% compared to Flash Attention. We conclude that the quadratic bottleneck is an artifact of naive implementation, not intelligence.
翻译:我们提出凝聚定理:注意力稀疏性是一种习得的拓扑性质,而非架构约束。通过对已训练语言模型的实证分析,我们发现注意力质量集中于一个独特的拓扑流形——并且该流形可以动态识别,而无需检查每个位置。我们证明了一个一般性结果:对于任何查询,将注意力投影到凝聚流形(锚点 + 窗口 + 动态 Top-k)上,即可实现与完整 $O(n^2)$ 注意力 100% 的输出等价性。这不是一种近似——而是无损等价。我们在 GPT-2、Pythia、Qwen2、TinyLlama 和 Mistral 上验证了这一点,在超过 1,500 个生成标记上展示了比特级精确的标记匹配。通过将此拓扑映射到硬件,我们的拓扑注意力内核在 131K 标记长度下实现了 159 倍的实测加速(3.94ms 对 628ms),并在 1M 标记长度下预计可实现超过 1,200 倍的加速,与 Flash Attention 相比,推理成本降低了超过 99.9%。我们得出结论:二次复杂度瓶颈是朴素实现的产物,而非智能的本质。