Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads

Visual perception tasks are predominantly solved by Vision Transformer (ViT) architectures, which, despite their effectiveness, encounter a computational bottleneck due to the quadratic complexity of computing self-attention. This inefficiency is largely due to the self-attention heads capturing redundant token interactions, reflecting inherent redundancy within visual data. Many works have aimed to reduce the computational complexity of self-attention in ViTs, leading to the development of efficient and sparse transformer architectures. In this paper, viewing through the efficiency lens, we realized that introducing any sparse self-attention strategy in ViTs can keep the computational overhead low. However, these strategies are sub-optimal as they often fail to capture fine-grained visual details. This observation leads us to propose a general, efficient, sparse architecture, named Fibottention, for approximating self-attention with superlinear complexity that is built upon Fibonacci sequences. The key strategies in Fibottention include: it excludes proximate tokens to reduce redundancy, employs structured sparsity by design to decrease computational demands, and incorporates inception-like diversity across attention heads. This diversity ensures the capture of complementary information through non-overlapping token interactions, optimizing both performance and resource utilization in ViTs for visual representation learning. We embed our Fibottention mechanism into multiple state-of-the-art transformer architectures dedicated to visual tasks. Leveraging only 2-6% of the elements in the self-attention heads, Fibottention in conjunction with ViT and its variants, consistently achieves significant performance boosts compared to standard ViTs in nine datasets across three domains $\unicode{x2013}$ image classification, video understanding, and robot learning tasks.

翻译：视觉感知任务主要由Vision Transformer（ViT）架构解决，尽管其效果显著，但由于自注意力计算具有二次复杂度，存在计算瓶颈。这种低效性主要源于自注意力头捕获了冗余的令牌交互，反映了视觉数据固有的冗余性。许多研究致力于降低ViT中自注意力的计算复杂度，推动了高效稀疏Transformer架构的发展。本文从效率视角出发，认识到在ViT中引入任何稀疏自注意力策略均可维持较低计算开销，但这些策略往往因无法捕捉细粒度视觉细节而次优。基于此观察，我们提出一种通用、高效、稀疏的架构——Fibottention，该架构基于斐波那契数列构建，能以超线性复杂度近似自注意力。Fibottention的核心策略包括：排除邻近令牌以降低冗余性；通过结构化稀疏设计减少计算需求；在注意力头间引入类初始模块的多样性机制。这种多样性通过非重叠的令牌交互确保捕获互补信息，从而在视觉表征学习中优化ViT的性能与资源利用率。我们将Fibottention机制嵌入多个专注于视觉任务的先进Transformer架构中。实验表明，仅利用自注意力头中2-6%的元素，Fibottention结合ViT及其变体，在图像分类、视频理解和机器人学习三大领域的九个数据集上，相较标准ViT均取得显著性能提升。