The reward model for Reinforcement Learning from Human Feedback (RLHF) has proven effective in fine-tuning Large Language Models (LLMs). Notably, collecting human feedback for RLHF can be resource-intensive and lead to scalability issues for LLMs and complex tasks. Our proposed framework Proto-RM leverages prototypical networks to enhance reward models under limited human feedback. By enabling stable and reliable structural learning from fewer samples, Proto-RM significantly enhances LLMs' adaptability and accuracy in interpreting human preferences. Extensive experiments on various datasets demonstrate that Proto-RM significantly improves the performance of reward models and LLMs in human feedback tasks, achieving comparable and usually better results than traditional methods, while requiring significantly less data. in data-limited scenarios. This research offers a promising direction for enhancing the efficiency of reward models and optimizing the fine-tuning of language models under restricted feedback conditions.
翻译:基于人类反馈的强化学习(RLHF)中的奖励模型已被证明在微调大型语言模型(LLM)方面具有显著效果。然而,为RLHF收集人类反馈通常需要大量资源,并可能导致LLM及复杂任务面临可扩展性问题。我们提出的框架Proto-RM利用原型网络,在有限的人类反馈条件下增强奖励模型的性能。通过从少量样本中实现稳定可靠的结构化学习,Proto-RM显著提升了LLM在理解人类偏好方面的适应性与准确性。在多个数据集上的大量实验表明,Proto-RM显著改善了奖励模型及LLM在人类反馈任务中的表现,在数据受限场景下,仅需远少于传统方法的数据量即可达到相当甚至更优的结果。本研究为提升奖励模型的效率,以及在受限反馈条件下优化语言模型的微调提供了有前景的方向。