While Reinforcement Learning (RL) has been proven essential for tuning large language models (LLMs), it can lead to reward over-optimization (ROO). Existing approaches address ROO by adding KL regularization, requiring computationally expensive hyperparameter tuning. Additionally, KL regularization focuses solely on regularizing the language policy, neglecting a potential source of regularization: the reward function itself. Inspired by demonstration-guided RL, we here introduce the Reward Calibration from Demonstration (RCfD), which leverages human demonstrations and a reward model to recalibrate the reward objective. Formally, given a prompt, the RCfD objective minimizes the distance between the demonstrations' and LLM's rewards rather than directly maximizing the reward function. This objective shift avoids incentivizing the LLM to exploit the reward model and promotes more natural and diverse language generation. We show the effectiveness of RCfD on three language tasks, which achieves comparable performance to carefully tuned baselines while mitigating ROO.
翻译:虽然强化学习(RL)已被证明对于微调大语言模型(LLM)至关重要,但它可能导致奖励过度优化(ROO)。现有方法通过添加KL正则化来解决ROO问题,这需要计算成本高昂的超参数调优。此外,KL正则化仅专注于约束语言策略,忽略了正则化的潜在来源:奖励函数本身。受演示引导强化学习的启发,本文提出了基于演示的奖励校准方法(RCfD),该方法利用人类演示和奖励模型重新校准奖励目标。形式上,给定提示词,RCfD目标最小化演示与LLM奖励之间的距离,而非直接最大化奖励函数。这种目标转换避免了激励LLM利用奖励模型,并促进了更自然、更多样化的语言生成。我们在三个语言任务上展示了RCfD的有效性,其在缓解ROO的同时,达到了与精心调优基线相当的性能。