A key challenge in applying reinforcement learning (RL) to diffusion large language models (dLLMs) lies in the intractability of their likelihood functions, which are essential for the RL objective, necessitating corresponding approximation in each training step. While existing methods approximate the log-likelihoods by their evidence lower bounds (ELBOs) via customized Monte Carlo (MC) sampling, the forward computational graphs of all MC samples need to be retained for the gradient computation of non-linear terms in the RL objective, resulting in significant memory overhead. This constraint restricts feasible sample sizes, leading to imprecise likelihood approximations and ultimately distorting the RL objective. To overcome this limitation, we propose \emph{Boundary-Guided Policy Optimization} (BGPO), a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective. This lower bound is carefully designed to satisfy two key properties: (1) Linearity: it is formulated in a linear sum where each term depends only on a single MC sample, thereby enabling gradient accumulation across samples and ensuring constant memory usage; (2) Equivalence: Both the value and gradient of this lower bound are equal to those of the ELBO-based objective in on-policy training, making it also an effective approximation for the original RL objective. These properties allow BGPO to adopt a large MC sample size, resulting in more accurate likelihood approximations and improved RL objective estimation, which in turn leads to enhanced performance. Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks. Our codes and models are available at \href{https://github.com/THU-KEG/BGPO}{https://github.com/THU-KEG/BGPO}.
翻译:将强化学习(RL)应用于扩散大语言模型(dLLMs)的一个关键挑战在于其似然函数的难处理性,而该函数对于RL目标至关重要,需要在每个训练步骤中进行相应的近似。现有方法通过定制的蒙特卡洛(MC)采样,利用证据下界(ELBO)来近似对数似然,但所有MC样本的前向计算图必须保留以计算RL目标中非线性项的梯度,这导致了显著的内存开销。该限制约束了可行的样本量,从而造成似然近似不精确,并最终扭曲了RL目标。为克服这一局限,我们提出了**边界引导策略优化**(BGPO),一种内存高效的RL算法,它最大化一个特别构建的、基于ELBO目标的下界。该下界经过精心设计以满足两个关键性质:(1)**线性性**:其表达为线性求和形式,其中每一项仅依赖于单个MC样本,从而能够跨样本进行梯度累积,并确保恒定的内存使用;(2)**等价性**:在策略训练中,该下界的值和梯度均与基于ELBO的目标相等,使其也成为原始RL目标的一个有效近似。这些性质使得BGPO可以采用较大的MC样本量,从而实现更精确的似然近似和更优的RL目标估计,进而提升性能。实验表明,在数学问题求解、代码生成和规划任务中,BGPO显著优于先前针对dLLMs的RL算法。我们的代码和模型发布于 \href{https://github.com/THU-KEG/BGPO}{https://github.com/THU-KEG/BGPO}。