Neural network parameter spaces are inherently non-injective, as distinct parameter configurations can realize identical functions through functional equivalence. While this symmetry is well understood in classical fully connected and convolutional models, it becomes substantially more intricate in modern attention-based architectures. Existing analyses of multihead attention have largely focused on the vanilla formulation, overlooking positional encodings that fundamentally reshape architectural symmetries. In this work, we provide a formal study of functional equivalence in Transformers with positional encodings. Focusing on the two most widely used variants--sinusoidal and rotary positional encodings (RoPE)--we show that sinusoidal encodings preserve the equivalence structure of vanilla attention, whereas rotary encodings significantly reduce the symmetry group, thereby enhancing expressivity. This offers a principled explanation for the growing prominence of RoPE in practice. We further examine how positional encodings affect linear mode connectivity, and through an alignment algorithm, empirically demonstrate that the presence and variability of connectivity across Transformer settings crucially depend on the positional encoding.
翻译:神经网络参数空间本质上不是单射的,因为不同的参数配置可以通过函数等价性实现相同的函数。虽然这种对称性在经典的全连接和卷积模型中已得到充分理解,但在现代基于注意力的架构中,它变得愈加复杂。现有的多头注意力分析主要聚焦于标准公式,忽略了从根本上重塑架构对称性的位置编码。在这项工作中,我们对方位编码的Transformer中的函数等价性进行了形式化研究。聚焦于两种最广泛使用的变体——正弦位置编码和旋转位置编码(RoPE),我们表明正弦编码保留了标准注意力的等价结构,而旋转编码显著减小了对称群,从而增强了表达能力。这为RoPE在实践中日益增长的应用提供了原理性解释。我们进一步研究了位置编码如何影响线性模式连通性,并通过对齐算法,实验证明Transformer设置中连通性的存在性和可变性关键取决于位置编码。