Autoregressive (AR) language models enforce a fixed left-to-right generation order, creating a fundamental limitation when the required output structure conflicts with natural reasoning (e.g., producing answers before explanations due to presentation or schema constraints). In such cases, AR models must commit to answers before generating intermediate reasoning, and this rigid constraint forces premature commitment. Masked diffusion language models (MDLMs), which iteratively refine all tokens in parallel, offer a way to decouple computation order from output structure. We validate this capability on GSM8K, Math500, and ReasonOrderQA, a benchmark we introduce with controlled difficulty and order-level evaluation. When prompts request answers before reasoning, AR models exhibit large accuracy gaps compared to standard chain-of-thought ordering (up to 67% relative drop), while MDLMs remain stable ($\leq$14% relative drop), a property we term "order robustness". Using ReasonOrderQA, we present evidence that MDLMs achieve order robustness by stabilizing simpler tokens (e.g., reasoning steps) earlier in the diffusion process than complex ones (e.g., final answers), enabling reasoning tokens to stabilize before answer commitment. Finally, we identify failure conditions where this advantage weakens, outlining the limits required for order robustness.
翻译:自回归(AR)语言模型强制采用固定的从左到右生成顺序,当所需的输出结构与自然推理过程相冲突时(例如,由于呈现方式或模式限制而需要在解释之前生成答案),便会产生根本性限制。在此类情况下,AR模型必须在生成中间推理步骤之前就确定答案,这种刚性约束导致了过早的决策。掩码扩散语言模型(MDLMs)通过并行迭代优化所有词元,提供了一种将计算顺序与输出结构解耦的方法。我们在GSM8K、Math500以及我们提出的具有可控难度和顺序级别评估的基准测试ReasonOrderQA上验证了这种能力。当提示要求先输出答案再进行推理时,与标准的思维链顺序相比,AR模型表现出巨大的准确率差距(相对下降高达67%),而MDLMs则保持稳定(相对下降$\leq$14%),我们将此特性称为“顺序鲁棒性”。利用ReasonOrderQA,我们提供的证据表明,MDLMs通过在扩散过程中比复杂词元(例如最终答案)更早地稳定较简单的词元(例如推理步骤)来实现顺序鲁棒性,从而使推理词元在答案确定之前得以稳定。最后,我们识别了这种优势减弱的失效条件,从而界定了实现顺序鲁棒性所需的限制范围。