The success of AI assistants based on language models (LLMs) hinges crucially on Reinforcement Learning from Human Feedback (RLHF), which enables the generation of responses more aligned with human preferences. As universal AI assistants, there's a growing expectation for them to perform consistently across various domains. However, previous work shows that Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples. This focus on quick reward gains undermines both the stability in training and the model's ability to generalize to new, unseen data. In this work, we propose a novel approach that can learn a consistent policy via RL across various data groups or domains. Given the challenges associated with acquiring group annotations, our method automatically classifies data into different groups, deliberately maximizing performance variance. Then, we optimize the policy to perform well on challenging groups. Lastly, leveraging the established groups, our approach adaptively adjusts the exploration space, allocating more learning capacity to more challenging data and preventing the model from over-optimizing on simpler data. Experimental results indicate that our approach significantly enhances training stability and model generalization.
翻译:基于语言模型(LLM)的AI助手的成功关键依赖于从人类反馈中强化学习(RLHF),这使得模型能够生成更符合人类偏好的响应。作为通用型AI助手,人们期望它们能在不同领域保持一致的性能表现。然而,先前研究表明,强化学习(RL)通常会利用捷径获取高奖励,同时忽视具有挑战性的样本。这种对快速奖励获取的侧重损害了训练稳定性以及模型对未见数据的泛化能力。本研究提出了一种新方法,能够通过强化学习在不同数据组或领域间学习一致的策略。针对获取分组标注的困难,我们的方法自动将数据划分为不同组别,并刻意最大化各组之间的性能差异。随后,我们优化策略使其在困难组别上表现优异。最后,利用已建立的分组,我们的方法自适应调整探索空间,为更具挑战性的数据分配更多学习容量,同时防止模型在简单数据上过度优化。实验结果表明,我们的方法显著提升了训练稳定性与模型泛化能力。