Selective Synchronization Attention

The Transformer architecture has become the foundation of modern deep learning, yet its core self-attention mechanism suffers from quadratic computational complexity and lacks grounding in biological neural computation. We propose Selective Synchronization Attention (SSA), a novel attention mechanism that replaces the standard dot-product self-attention with a closed-form operator derived from the steady-state solution of the Kuramoto model of coupled oscillators. In SSA, each token is represented as an oscillator characterized by a learnable natural frequency and phase; the synchronization strength between token pairs, determined by a frequency-dependent coupling and phase-locking condition, serves as the attention weight. This formulation provides three key advantages: (i) natural sparsity arising from the phase-locking threshold, whereby tokens with incompatible frequencies automatically receive zero attention weight without explicit masking; (ii) unified positional-semantic encoding through the natural frequency spectrum, eliminating the need for separate positional encodings; and (iii) a single-pass, closed-form computation that avoids iterative ODE integration, with all components (coupling, order parameter, synchronization) derived from the oscillatory framework. We instantiate SSA within the Oscillatory Synchronization Network (OSN), a drop-in replacement for the Transformer block. Analysis of the synchronization matrices reveals non-uniform, head-diverse coupling patterns even at initialization, demonstrating a stronger architectural inductive bias than the approximately uniform attention produced by randomly initialized Transformers.

翻译：Transformer架构已成为现代深度学习的基石，但其核心自注意力机制存在二次计算复杂度问题，且缺乏生物神经计算基础。本文提出选择性同步注意力机制，这是一种新型注意力机制，它用耦合振子Kuramoto模型稳态解导出的闭式算子替代标准点积自注意力。在SSA中，每个词元被表征为具有可学习固有频率和相位的振子；词元对间的同步强度（由频率相关耦合和锁相条件决定）作为注意力权重。该公式具有三个关键优势：（i）锁相阈值自然产生稀疏性，频率不兼容的词元无需显式掩码即可自动获得零注意力权重；（ii）通过固有频谱实现位置-语义统一编码，无需独立的位置编码；（iii）单次闭式计算避免迭代ODE积分，所有组件（耦合、序参量、同步）均从振荡框架导出。我们在振荡同步网络中实例化SSA，该网络可作为Transformer块的即插即用替代方案。对同步矩阵的分析表明，即使在初始化阶段也存在非均匀、多头分化的耦合模式，这比随机初始化Transformer产生的近似均匀注意力具有更强的架构归纳偏置。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

「深度学习视觉注意力」最新2022研究综述，概述50种软硬注意力机制方法

专知会员服务

113+阅读 · 2022年4月20日