Elastic Attention Cores for Scalable Vision Transformers

Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the $N$ image patches only directly interact with a resolution invariant set of $C$ learned "core" embeddings, this yields linear complexity $O(N)$ for predetermined $C$, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of $N$ input tokens, avoiding a small $C$-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.

翻译：视觉 Transformer (ViT) 通过利用全局全连接自注意力机制实现了强大的数据驱动可扩展性。然而，这种灵活性带来的计算成本与图像分辨率呈二次方增长，限制了 ViT 在高分辨率领域的应用。该方法的核心假设是：成对令牌交互对于学习丰富的视觉语义表征是必要的。在本研究中，我们挑战了这一假设，证明无需任何直接的补丁间交互即可学习有效的视觉表征。我们提出 VECA（视觉弹性核心注意力），一种视觉 Transformer 架构，它通过一小部分可学习的核心实现了高效的线性时间复杂度核心-外围结构化注意力。在 VECA 中，这些核心充当通信接口：补丁令牌仅通过核心令牌交换信息，这些核心令牌从零开始初始化并在各层间传播。由于 $N$ 个图像补丁仅与一组分辨率不变的学习得到的 $C$ 个“核心”嵌入直接交互，因此对于预定的 $C$ 而言，时间复杂度为 $O(N)$，从而避免了二次方缩放。与先前的交叉注意力架构相比，VECA 维护并迭代更新全部 $N$ 个输入令牌，避免了 $C$ 路小瓶颈。结合沿核心轴的嵌套训练，我们的模型可在推理阶段弹性地权衡计算量与精度。在分类和密集任务上，VECA 取得了与最新视觉基础模型相媲美的性能，同时降低了计算成本。我们的研究结果确立了弹性核心-外围注意力作为视觉 Transformer 的一种可扩展替代构建模块。