Any-order autoregressive models (AO-ARMs) offer a promising path toward efficient masked diffusion by enabling native key-value caching, but competitive performance has so far required two-stream attention, typically motivated as a means of decoupling token content from position. In this work, we argue that two-stream attention may be serving a more subtle role. We identify a structural-semantic tradeoff in any-order generation: the hidden representation at each step must simultaneously attend to semantically informative tokens for prediction and structurally recent tokens for summarization, objectives that compete for attention capacity in a single stream but can specialize across two streams. To isolate this tradeoff from position-content separation, we propose Decoupled RoPE, a modification to rotary position embeddings that provides target position information without revealing target content. Decoupled RoPE performs competitively at short sequence lengths--where semantic and structural proximity coincide--but degrades as sequence length increases and the two orderings diverge. These results suggest that the success of two-stream attention stems not merely from separating position from content, but from circumventing the deeper structural-semantic tradeoff inherent to any-order generation.
翻译:任意顺序自回归模型(AO-ARMs)通过支持原生键值缓存,为高效掩码扩散提供了一条有前景的路径,但迄今为止,具有竞争力的性能需要双流注意力机制,其动机通常被解释为将词元内容与位置解耦的一种手段。在本研究中,我们认为双流注意力机制可能扮演着更为微妙的角色。我们识别出任意顺序生成中存在一种结构-语义权衡:每一步的隐藏表示必须同时关注用于预测的语义信息丰富的词元,以及用于摘要的结构上最近的词元,这两个目标在单流注意力中会争夺注意力容量,但可以在双流中实现专业化。为了将这种权衡与位置-内容分离隔离开来,我们提出了解耦RoPE,这是一种对旋转位置编码的修改,它提供目标位置信息而不揭示目标内容。解耦RoPE在短序列长度下表现具有竞争力——此时语义邻近性与结构邻近性重合——但随着序列长度增加以及两种顺序发生偏离,其性能会下降。这些结果表明,双流注意力机制的成功不仅仅源于将位置与内容分离,更在于规避了任意顺序生成所固有的更深层次的结构-语义权衡。