Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.
翻译:扩散大语言模型(dLLMs)通过迭代去噪生成文本,然而当前的解码策略丢弃了丰富的中间预测,仅保留最终输出。本文揭示了一个关键现象——时序振荡,即正确答案常在中间过程中出现,但在后续去噪步骤中被覆盖。为解决此问题,我们引入了两种利用时序一致性的互补方法:1)时序自一致性投票,一种无需训练、在测试时解码的策略,通过聚合去噪步骤中的预测来选择最一致的输出;2)一种后训练方法,称为时序一致性强化,它使用时序语义熵(TSE)——一种衡量中间预测间语义稳定性的指标——作为奖励信号,以鼓励稳定的生成。在多个基准测试上的实证结果证明了我们方法的有效性。仅使用负TSE奖励,我们在Countdown数据集上观察到相较于现有dLLM平均24.7%的显著提升。结合准确性奖励,我们分别在GSM8K、MATH500、SVAMP和Countdown数据集上实现了2.0%、4.3%、6.6%和25.3%的绝对增益。我们的发现强调了dLLMs中时序动态尚未开发的潜力,并提供了两种简单而有效的工具来利用它们。