Positional encodings are essential to transformer-based generative models, yet their behavior in multimodal and attention-sharing settings is not fully understood. In this work, we present a principled analysis of Rotary Positional Embeddings (RoPE), showing that RoPE naturally decomposes into frequency components with distinct positional sensitivities. We demonstrate that this frequency structure explains why shared-attention mechanisms, where a target image is generated while attending to tokens from a reference image, can lead to reference copying, in which the model reproduces content from the reference instead of extracting only its stylistic cues. Our analysis reveals that the high-frequency components of RoPE dominate the attention computation, forcing queries to attend mainly to spatially aligned reference tokens and thereby inducing this unintended copying behavior. Building on these insights, we introduce a method for selectively modulating RoPE frequency bands so that attention reflects semantic similarity rather than strict positional alignment. Applied to modern transformer-based diffusion architectures, where all tokens share attention, this modulation restores stable and meaningful shared attention. As a result, it enables effective control over the degree of style transfer versus content copying, yielding a proper style-aligned generation process in which stylistic attributes are transferred without duplicating reference content.
翻译:位置编码对于基于Transformer的生成模型至关重要,然而其在多模态和注意力共享设置中的行为尚未被完全理解。在本工作中,我们对旋转位置嵌入(RoPE)进行了原理性分析,表明RoPE可自然分解为具有不同位置敏感度的频率分量。我们证明,这种频率结构解释了为何共享注意力机制(即在生成目标图像时同时关注参考图像的标记)会导致参考复制现象,即模型复制参考图像的内容而非仅提取其风格线索。我们的分析揭示,RoPE的高频分量主导了注意力计算,迫使查询主要关注空间对齐的参考标记,从而引发这种非预期的复制行为。基于这些发现,我们提出了一种选择性调制RoPE频带的方法,使注意力反映语义相似性而非严格的位置对齐。将此方法应用于所有标记共享注意力的现代基于Transformer的扩散架构时,该调制能恢复稳定且有意义的共享注意力。因此,它可以有效控制风格迁移与内容复制之间的平衡,实现恰当的风格对齐生成过程——在转移风格属性的同时避免复制参考内容。