Emerging from the monolithic pairwise attention mechanism in conventional Transformer models, there is a growing interest in leveraging sparse interactions that align more closely with biological principles. Approaches including the Set Transformer and the Perceiver employ cross-attention consolidated with a latent space that forms an attention bottleneck with limited capacity. Building upon recent neuroscience studies of Global Workspace Theory and associative memory, we propose the Associative Transformer (AiT). AiT induces low-rank explicit memory that serves as both priors to guide bottleneck attention in the shared workspace and attractors within associative memory of a Hopfield network. Through joint end-to-end training, these priors naturally develop module specialization, each contributing a distinct inductive bias to form attention bottlenecks. A bottleneck can foster competition among inputs for writing information into the memory. We show that AiT is a sparse representation learner, learning distinct priors through the bottlenecks that are complexity-invariant to input quantities and dimensions. AiT demonstrates its superiority over methods such as the Set Transformer, Vision Transformer, and Coordination in various vision tasks.
翻译:从传统Transformer模型中单一的成对注意力机制出发,人们越来越关注利用更符合生物学原理的稀疏交互。包括Set Transformer和Perceiver在内的方法采用了与潜在空间结合的交叉注意力,该空间形成了容量有限的注意力瓶颈。基于近期关于全局工作空间理论和联想记忆的神经科学研究,我们提出了关联Transformer(AiT)。AiT引入了低秩显式记忆,该记忆既作为指导共享工作空间中瓶颈注意力的先验知识,又作为Hopfield网络联想记忆中的吸引子。通过端到端的联合训练,这些先验知识自然地发展出模块专业化,每个模块贡献不同的归纳偏置以形成注意力瓶颈。瓶颈可以促进输入之间为将信息写入记忆而进行的竞争。我们证明AiT是一种稀疏表示学习器,它通过瓶颈学习独特的先验知识,这些先验知识在复杂度上对输入数量和维度保持不变。在各种视觉任务中,AiT展示了优于Set Transformer、Vision Transformer和Coordination等方法的表现。