Emerging from the monolithic pairwise attention mechanism in conventional Transformer models, there is a growing interest in leveraging sparse interactions that align more closely with biological principles. Approaches including the Set Transformer and the Perceiver employ cross-attention consolidated with a latent space that forms an attention bottleneck with limited capacity. Building upon recent neuroscience studies of Global Workspace Theory and associative memory, we propose the Associative Transformer (AiT). AiT induces low-rank explicit memory that serves as both priors to guide bottleneck attention in the shared workspace and attractors within associative memory of a Hopfield network. Through joint end-to-end training, these priors naturally develop module specialization, each contributing a distinct inductive bias to form attention bottlenecks. A bottleneck can foster competition among inputs for writing information into the memory. We show that AiT is a sparse representation learner, learning distinct priors through the bottlenecks that are complexity-invariant to input quantities and dimensions. AiT demonstrates its superiority over methods such as the Set Transformer, Vision Transformer, and Coordination in various vision tasks.
翻译:从传统Transformer模型中单一的双向注意力机制中涌现出越来越大的兴趣,人们开始利用更符合生物学原则的稀疏交互。包括Set Transformer和Perceiver在内的方法采用与潜在空间结合的交叉注意力,形成一个容量有限的注意力瓶颈。基于近期关于全局工作空间理论和联想记忆的神经科学研究,我们提出了关联式Transformer(AiT)。AiT引入低秩显式记忆,这些记忆既作为指导共享工作空间中瓶颈注意力的先验知识,也作为Hopfield网络联想记忆中的吸引子。通过端到端联合训练,这些先验知识自然形成模块专业化,每个模块贡献独特的归纳偏置以形成注意力瓶颈。瓶颈可以促进输入之间竞争以将信息写入记忆。我们表明AiT是一个稀疏表示学习器,通过瓶颈学习不同的先验知识,这些先验知识对输入数量和维度具有复杂性不变性。在各种视觉任务中,AiT展示了其优于Set Transformer、Vision Transformer和Coordination等方法的表现。