Adaptive Dense Reward: Understanding the Gap Between Action and Reward Space in Alignment

Yanshi Li,Shaopan Xiong,Gengru Chen,Xiaoyang Li,Yijia Luo,Xingyao Zhang,Yanhui Huang,Xingyuan Bu,Yingshui Tan,Chun Yuan,Jiamang Wang,Wenbo Su,Bo Zheng

Reinforcement Learning from Human Feedback (RLHF) has proven highly effective in aligning Large Language Models (LLMs) with human preferences. However, the original RLHF typically optimizes under an overall reward, which can lead to a suboptimal learning process. This limitation stems from RLHF's lack of awareness regarding which specific tokens should be reinforced or suppressed. Moreover, conflicts in supervision can arise, for instance, when a chosen response includes erroneous tokens, while a rejected response contains accurate elements. To rectify these shortcomings, increasing dense reward methods, such as step-wise and token-wise RLHF, have been proposed. However, these existing methods are limited to specific tasks (like mathematics). In this paper, we propose the ``Adaptive Message-wise RLHF'' method, which robustly applies to various tasks. By defining pivot tokens as key indicators, our approach adaptively identifies essential information and converts sequence-level supervision into fine-grained, subsequence-level supervision. This aligns the density of rewards and action spaces more closely with the information density of the input. Experiments demonstrate that our method can be integrated into various training methods, significantly mitigating hallucinations and catastrophic forgetting problems, while outperforming other methods on multiple evaluation metrics. Our method improves the success rate on adversarial samples by 10\% compared to the sample-wise approach, and achieves a 1.3\% improvement on evaluation benchmarks such as MMLU, GSM8K, HumanEval, etc.

翻译：基于人类反馈的强化学习（RLHF）已被证明在使大型语言模型（LLM）与人类偏好对齐方面非常有效。然而，原始的RLHF通常在整体奖励下进行优化，这可能导致次优的学习过程。这一局限性源于RLHF缺乏对哪些特定标记应被强化或抑制的认知。此外，监督中的冲突也可能出现，例如，当被选中的回复包含错误标记，而被拒绝的回复却含有准确元素时。为了纠正这些缺陷，人们提出了增加稠密奖励的方法，例如逐步和逐标记的RLHF。然而，这些现有方法仅限于特定任务（如数学）。在本文中，我们提出了“自适应消息级RLHF”方法，该方法能够稳健地应用于各种任务。通过将枢纽标记定义为关键指标，我们的方法自适应地识别关键信息，并将序列级监督转化为细粒度的子序列级监督。这使得奖励密度和动作空间更紧密地与输入的信息密度对齐。实验表明，我们的方法可以集成到各种训练方法中，显著缓解幻觉和灾难性遗忘问题，同时在多项评估指标上优于其他方法。与样本级方法相比，我们的方法在对抗样本上的成功率提高了10%，并在MMLU、GSM8K、HumanEval等评估基准上实现了1.3%的提升。