The quadratic cost of softmax attention limits Transformer scalability in high-resolution vision. We introduce Infinite Self-Attention (InfSA), a spectral reformulation that treats each attention layer as a diffusion step on a content-adaptive token graph, accumulating multi-hop interactions through a discounted Neumann series over attention matrices. This links self-attention to classical graph centrality (Katz, PageRank, eigenvector centrality) for interpretable token weighting. We also show the Neumann kernel equals the fundamental matrix of an absorbing Markov chain, so a token's centrality is its expected number of random-walk visits before absorption. We then propose Linear-InfSA, a linear-time variant that approximates the principal eigenvector of the implicit attention operator without forming the full attention matrix. It keeps an auxiliary state of fixed size proportional to per-head dimension dh (independent of sequence length N), is drop-in compatible with Vision Transformers, and supports stable training at 4096 by 4096 and inference at 9216 by 9216 (about 332k tokens). In a 4-layer ViT (53.5M parameters, 59 GFLOPs at 224 by 224), Linear-InfSA reaches 84.7% top-1 on ImageNet-1K, a +3.2 point architectural gain over an equal-depth softmax ViT trained with the same recipe. On ImageNet-V2, InfViT variants outperform all compared baselines (up to 79.8% vs 76.8%), indicating robustness under distribution shift. On an A100 40GB GPU, Linear-InfViT runs at 231 images/s and 0.87 J/image (13x better throughput and energy than equal-depth ViT) and is the only tested model to complete 9216 by 9216 inference without out-of-memory. The linear approximation closely matches the dominant eigenvector of the quadratic operator (cosine 0.985). Code available at: https://huggingface.co/groffo/infinite-self-attention or https://github.com/giorgioroffo/infinite-self-attention
翻译:softmax注意力的二次计算成本限制了Transformer在高分辨率视觉任务中的可扩展性。本文提出了无限自注意力(InfSA),这是一种谱重构方法,将每个注意力层视为内容自适应令牌图上的扩散步骤,通过注意力矩阵的折现诺依曼级数累积多跳交互。这将自注意力与经典图中心性度量(Katz、PageRank、特征向量中心性)联系起来,实现了可解释的令牌加权。我们还证明诺依曼核等于吸收马尔可夫链的基本矩阵,因此令牌的中心性可解释为随机游走在吸收前的预期访问次数。随后我们提出了Linear-InfSA,这是一种线性时间变体,能够在不构建完整注意力矩阵的情况下近似隐式注意力算子的主特征向量。该算法保持与每头维度dh成正比(与序列长度N无关)的固定大小辅助状态,可直接替换视觉Transformer中的注意力模块,并支持在4096×4096分辨率下稳定训练及在9216×9216分辨率(约33.2万令牌)下进行推理。在4层ViT架构中(5350万参数,224×224分辨率下59 GFLOPs),Linear-InfSA在ImageNet-1K上达到84.7%的top-1准确率,相比同等深度softmax ViT在相同训练配置下获得+3.2个百分点的架构增益。在ImageNet-V2上,InfViT变体优于所有对比基线(最高达79.8% vs 76.8%),表明其具备分布偏移下的鲁棒性。在A100 40GB GPU上,Linear-InfViT以231图像/秒的速度运行,能耗为0.87 J/图像(吞吐量和能效分别达到同等深度ViT的13倍),并且是唯一能完成9216×9216分辨率推理而不出现内存溢出的测试模型。线性近似与二次算子的主特征向量高度吻合(余弦相似度0.985)。代码发布于:https://huggingface.co/groffo/infinite-self-attention 或 https://github.com/giorgioroffo/infinite-self-attention