Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of dLLMs. In this paper, we reveal a counter-intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation challenges the premise of existing RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning is better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap
翻译:扩散大语言模型(dLLMs)打破了传统大语言模型严格的从左到右约束,允许以任意顺序生成词元。直观上,这种灵活性意味着其解空间严格包含了固定自回归轨迹,理论上为数学和编程等通用任务解锁了更优的推理潜力。因此,许多研究利用强化学习(RL)来激发dLLMs的推理能力。本文揭示了一个反直觉的现实:在当前形式下,任意顺序生成非但没有扩展dLLMs的推理边界,反而使其收窄。我们发现dLLMs倾向于利用这种顺序灵活性来规避对探索至关重要的高不确定性词元,导致解空间过早坍缩。这一观察挑战了现有dLLMs强化学习方法的前提——这些方法往往投入大量复杂度(如处理组合轨迹和难解似然)以保持这种灵活性。我们证明,通过有意放弃任意顺序并改用标准组相对策略优化(GRPO),能更有效地激发推理能力。我们的方法JustGRPO极简却出奇有效(例如在GSM8K上达到89.1%准确率),同时完全保留了dLLMs的并行解码能力。项目页面:https://nzl-thu.github.io/the-flexibility-trap