The Transformer architecture has become the foundation of modern deep learning, yet its core self-attention mechanism suffers from quadratic computational complexity and lacks grounding in biological neural computation. We propose Selective Synchronization Attention (SSA), a novel attention mechanism that replaces the standard dot-product self-attention with a closed-form operator derived from the steady-state solution of the Kuramoto model of coupled oscillators. In SSA, each token is represented as an oscillator characterized by a learnable natural frequency and phase; the synchronization strength between token pairs, determined by a frequency-dependent coupling and phase-locking condition, serves as the attention weight. This formulation provides three key advantages: (i) natural sparsity arising from the phase-locking threshold, whereby tokens with incompatible frequencies automatically receive zero attention weight without explicit masking; (ii) unified positional-semantic encoding through the natural frequency spectrum, eliminating the need for separate positional encodings; and (iii) a single-pass, closed-form computation that avoids iterative ODE integration, with all components (coupling, order parameter, synchronization) derived from the oscillatory framework. We instantiate SSA within the Oscillatory Synchronization Network (OSN), a drop-in replacement for the Transformer block. Analysis of the synchronization matrices reveals non-uniform, head-diverse coupling patterns even at initialization, demonstrating a stronger architectural inductive bias than the approximately uniform attention produced by randomly initialized Transformers.
翻译:Transformer架构已成为现代深度学习的基石,但其核心自注意力机制存在二次计算复杂度问题,且缺乏生物神经计算基础。本文提出选择性同步注意力机制,这是一种新型注意力机制,它用耦合振子Kuramoto模型稳态解导出的闭式算子替代标准点积自注意力。在SSA中,每个词元被表征为具有可学习固有频率和相位的振子;词元对间的同步强度(由频率相关耦合和锁相条件决定)作为注意力权重。该公式具有三个关键优势:(i)锁相阈值自然产生稀疏性,频率不兼容的词元无需显式掩码即可自动获得零注意力权重;(ii)通过固有频谱实现位置-语义统一编码,无需独立的位置编码;(iii)单次闭式计算避免迭代ODE积分,所有组件(耦合、序参量、同步)均从振荡框架导出。我们在振荡同步网络中实例化SSA,该网络可作为Transformer块的即插即用替代方案。对同步矩阵的分析表明,即使在初始化阶段也存在非均匀、多头分化的耦合模式,这比随机初始化Transformer产生的近似均匀注意力具有更强的架构归纳偏置。