The success of AI assistants based on language models (LLMs) hinges crucially on Reinforcement Learning from Human Feedback (RLHF), which enables the generation of responses more aligned with human preferences. As universal AI assistants, there's a growing expectation for them to perform consistently across various domains. However, previous work shows that Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples. This focus on quick reward gains undermines both the stability in training and the model's ability to generalize to new, unseen data. In this work, we propose a novel approach that can learn a consistent policy via RL across various data groups or domains. Given the challenges associated with acquiring group annotations, our method automatically classifies data into different groups, deliberately maximizing performance variance. Then, we optimize the policy to perform well on challenging groups. Lastly, leveraging the established groups, our approach adaptively adjusts the exploration space, allocating more learning capacity to more challenging data and preventing the model from over-optimizing on simpler data. Experimental results indicate that our approach significantly enhances training stability and model generalization.
翻译:基于语言模型(LLM)的AI助手的成功关键依赖于从人类反馈中强化学习(RLHF),该方法能够生成更符合人类偏好的回应。作为通用型AI助手,人们日益期待它们能在不同领域保持一致性表现。然而,先前研究表明强化学习(RL)常利用捷径获取高奖励而忽略困难样本。这种对快速奖励获取的侧重不仅损害了训练稳定性,也削弱了模型对未见数据的泛化能力。本研究提出一种新方法,通过强化学习在不同数据组或领域间学习一致策略。针对组标注获取困难的问题,该方法自动将数据分类为不同组,刻意最大化性能差异。进而优化策略以提升困难组的性能表现。最后,基于已建立的组划分,我们的方法自适应调整探索空间,为更困难数据分配更多学习容量,同时防止模型在简单数据上过度优化。实验结果表明,该方法显著提升了训练稳定性与模型泛化能力。