The reasoning performance of large language models (LLMs) can be substantially improved by training them with reinforcement learning (RL). The RL objective for LLM training involves a regularization term, which is the reverse Kullback-Leibler (KL) divergence between the trained policy and the reference policy. Since computing the KL divergence exactly is intractable, various estimators are used in practice to estimate it from on-policy samples. Despite its wide adoption, including in several open-source libraries, there is no systematic study analyzing the numerous ways of incorporating KL estimators in the objective and their effect on the downstream performance of RL-trained models. Recent works show that prevailing practices for incorporating KL regularization do not provide correct gradients for stated objectives, creating a discrepancy between the objective and its implementation. In this paper, we further analyze these practices and study the gradients of several estimators configurations, revealing how design choices shape gradient bias. We substantiate these findings with empirical observations by RL fine-tuning \texttt{Qwen2.5-7B}, \texttt{Llama-3.1-8B-Instruct} and \texttt{Qwen3-4B-Instruct-2507} with different configurations and evaluating their performance on both in- and out-of-distribution tasks. Through our analysis, we observe that, in on-policy settings: (1) estimator configurations with biased gradients can result in training instabilities; and (2) using estimator configurations resulting in unbiased gradients leads to better performance on in-domain as well as out-of-domain tasks. We also investigate the performance resulting from different KL configurations in off-policy settings and observe that KL regularization can help stabilize off-policy RL training resulting from asynchronous setups.
翻译:通过强化学习训练,大语言模型的推理性能可获得显著提升。大语言模型训练的强化学习目标包含一个正则化项,即训练策略与参考策略之间的反向Kullback-Leibler散度。由于精确计算KL散度不可行,实践中常采用多种估计器基于在线策略样本进行估计。尽管该方法已被广泛采用(包括在多个开源库中),目前仍缺乏系统性研究来分析KL估计器融入目标函数的多种方式及其对强化学习训练模型下游性能的影响。近期研究表明,当前主流的KL正则化实践并未为既定目标提供正确的梯度,导致目标函数与其实现之间存在差异。本文进一步分析这些实践,研究多种估计器配置的梯度特性,揭示设计选择如何影响梯度偏差。我们通过实证观察验证这些发现:使用不同配置对\texttt{Qwen2.5-7B}、\texttt{Llama-3.1-8B-Instruct}和\texttt{Qwen3-4B-Instruct-2507}进行强化学习微调,并评估它们在分布内与分布外任务上的性能。通过分析发现,在在线策略设置中:(1)具有偏差梯度的估计器配置可能导致训练不稳定;(2)采用能产生无偏梯度的估计器配置可在分布内及分布外任务上获得更优性能。同时,我们研究了离线策略设置中不同KL配置的性能表现,观察到KL正则化有助于稳定异步设置产生的离线策略强化学习训练。