Kuramoto Attention: Synchronizing Self-Attention on the Torus

We introduce Kuramoto attention, a self-attention layer in which each hidden coordinate is an angle. The layer scores tokens by gated cosine similarity, attends over previous phase states, and updates each token by the tangent component of the attention-weighted circular mean. Because the values are the raw phase states, this update is exactly the Kuramoto coupling term $\sum_u A_{t,u}\sin(θ_u-θ_t)$, with the attention matrix acting as an adaptive, content-dependent coupling kernel. Equivalently, the gated score is a learned metric on the torus that selects which tokens couple, and the update pulls each token toward the circular mean of the tokens it selects, tightening their phase agreement. The same two ingredients, an invariant similarity score and an on-manifold mean, define such a layer on any compact group; the torus is the abelian case, where both are closed-form. The softmax weights solve an entropy-regularized phase-retrieval problem, and rotary position enters as a position-dependent phase drift in the score. On enwiki8 character-level language modeling, the layer trains as a functional language model whose bits-per-character stays close to a strong matched RoPE+SwiGLU transformer: within $0.02$ BPC at one million parameters ($1.637\pm0.010$ versus $1.616\pm0.004$) and level on the median at five million ($1.448$ versus $1.452$ over five seeds) with the transformer ahead on the mean ($1.468$ versus $1.456$). These experiments establish that the constrained geometric structure is a viable language model at this scale; the structure itself, and its synchronization reading, is the contribution. Ablations isolate the load-bearing components, and the result gives a compact bridge between self-attention and phase synchronization.

翻译：我们提出Kuramoto注意力，一种每个隐藏坐标均为角度的自注意力层。该层通过门控余弦相似度对词元评分，关注先前相位状态，并通过注意力加权圆均值的切向分量更新每个词元。由于值即为原始相位状态，该更新恰好是Kuramoto耦合项$\sum_u A_{t,u}\sin(θ_u-θ_t)$，其中注意力矩阵充当自适应、内容相关的耦合核。等价地，门控得分是环面上的学习度量，用于选择耦合的词元，而更新则将每个词元拉向其所选词元的圆均值，从而增强其相位一致性。这两个要素——不变相似度得分与流形上的均值——定义了任意紧群上的此类层；环面是阿贝尔情形，两者均为闭式。Softmax权重求解熵正则化相位检索问题，旋转位置编码则作为得分中与位置相关的相位漂移。在enwiki8字符级语言建模中，该层训练成为功能型语言模型，其每字符比特数接近强匹配的RoPE+SwiGLU Transformer：百万参数规模下相差0.02 BPC（1.637±0.010对比1.616±0.004），五百万参数下中位数持平（五次试验中1.448对比1.452），Transformer的均值略优（1.468对比1.456）。这些实验证明，受约束的几何结构在此规模下是可行的语言模型；而结构本身及其同步性解读方为贡献。消融实验分离了承重组件，结果在自注意力与相位同步之间建立了紧凑桥梁。