Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning. However, RL algorithms that are well-suited for dLLMs' unique characteristics have yet to be developed. This paper proposes Distribution Matching Policy Optimization (DMPO), a principled and theoretically grounded RL fine-tuning method specifically designed to enhance the reasoning capabilities of dLLMs by matching the dLLM policy distribution to the optimal, reward-tilted one through cross-entropy optimization. We identify a key challenge in the implementation with a small training batch size and propose several effective solutions through a novel weight baseline subtraction technique. DMPO exhibits superior performance on multiple reasoning benchmarks without supervised fine-tuning, with an accuracy improvement of up to $54.3\%$ over previously SOTA baselines and $66.41\%$ over the base model, underscoring the effectiveness of the distribution matching framework. Our code is available at https://github.com/yuchen-zhu-zyc/DMPO.
翻译:扩散大语言模型(dLLMs)是自回归大语言模型(AR-LLMs)的一种有前景的替代方案,因其可能实现更高的推理吞吐量。强化学习(RL)是使 dLLMs 在重要任务(如推理)上达到与 AR-LLMs 相当性能的关键组成部分。然而,目前尚缺乏针对 dLLMs 独特特性而设计的合适 RL 算法。本文提出了分布匹配策略优化(DMPO),这是一种原理清晰、理论依据充分的 RL 微调方法,专门设计用于通过交叉熵优化将 dLLM 的策略分布与最优的、奖励倾斜的分布相匹配,从而增强 dLLMs 的推理能力。我们指出了在小训练批次大小实施中的一个关键挑战,并通过一种新颖的权重基线消减技术提出了几种有效的解决方案。DMPO 在多个推理基准测试中展现出卓越性能,无需监督微调,其准确率相较于之前的 SOTA 基线最高提升了 $54.3\%$,相较于基础模型提升了 $66.41\%$,这突显了分布匹配框架的有效性。我们的代码可在 https://github.com/yuchen-zhu-zyc/DMPO 获取。