Autoregressive language models (ARMs) suffer from the reversal curse: after learning that "$A$ is $B$", they often fail on the reverse query "$B$ is $A$". Masked diffusion-based language models (MDMs) exhibit this failure in a much weaker form, but the underlying reason has remained unclear. A common explanation attributes this mitigation to the any-order training objective. However, observing "[MASK] is $B$" during training does not necessarily teach the model to handle the reverse prompt "$B$ is [MASK]". We show that the mitigation arises from architectural structure and its interaction with training. In a one-layer Transformer encoder, weight sharing couples the two directions by making forward and reverse attention scores positively correlated. In the same setting, we further show that the corresponding gradients are aligned, so minimizing the forward loss also reduces the reverse loss. Experiments on both controlled toy tasks and large-scale diffusion language models support these mechanisms, explaining why MDMs partially overcome a failure mode that persists in strong ARMs.
翻译:自回归语言模型(ARMs)存在逆转诅咒现象:在习得“$A$是$B$”后,它们往往无法正确处理反向查询“$B$是$A$”。基于掩码扩散的语言模型(MDMs)虽同样存在此缺陷,但其表现形式显著减弱,而内在机理尚未明晰。一种常见解释将此缓解归因于任意顺序的训练目标。然而,在训练中观测到“[MASK]是$B$”并不必然使模型学会处理反向提示“$B$是[MASK]”。我们证明该缓解效应源于模型架构结构及其与训练过程的交互作用。在单层Transformer编码器中,权重共享通过使正向与反向注意力分数呈正相关,从而耦合了两个方向的处理机制。在此相同设定下,我们进一步证明对应梯度方向一致,因此最小化正向损失的同时也降低了反向损失。在受控玩具任务与大规模扩散语言模型上的实验均支持这些机制,这解释了为何MDMs能部分克服在强ARMs中持续存在的失效模式。