Large Language Models (LLMs) have demonstrated remarkable capabilities in open-ended text generation tasks. However, the inherent open-ended nature of these tasks implies that there is always room for improvement in the quality of model responses. To address this challenge, various approaches have been proposed to enhance the performance of LLMs. There has been a growing focus on enabling LLMs to self-improve their response quality, thereby reducing the reliance on extensive human annotation efforts for collecting diverse and high-quality training data. Recently, prompting-based methods have been widely explored among self-improvement methods owing to their effectiveness, efficiency, and convenience. However, those methods usually require explicitly and thoroughly written rubrics as inputs to LLMs. It is expensive and challenging to manually derive and provide all necessary rubrics with a real-world complex goal for improvement (e.g., being more helpful and less harmful). To this end, we propose an ImPlicit Self-ImprovemenT (PIT) framework that implicitly learns the improvement goal from human preference data. PIT only requires preference data that are used to train reward models without extra human efforts. Specifically, we reformulate the training objective of reinforcement learning from human feedback (RLHF) -- instead of maximizing response quality for a given input, we maximize the quality gap of the response conditioned on a reference response. In this way, PIT is implicitly trained with the improvement goal of better aligning with human preferences. Experiments on two real-world datasets and one synthetic dataset show that our method significantly outperforms prompting-based methods.
翻译:大型语言模型(LLMs)在开放式文本生成任务中展现了卓越的能力。然而,这些任务固有的开放性意味着模型响应的质量始终存在提升空间。为解决这一挑战,已提出多种方法增强LLMs的性能。近年来,研究者越来越关注使LLMs能够自我改进其响应质量,从而减少为收集多样化高质量训练数据而进行的大量人工标注工作。在自我改进方法中,基于提示(prompting)的方法因其有效性、高效性和便利性而被广泛探索。然而,此类方法通常需要向LLMs提供明确且详尽编写的评估准则。在实际复杂目标(例如,更有帮助且更少有害性)下,手动推导并提供所有必要准则既昂贵又困难。为此,我们提出一种隐性自我改进(ImPlicit Self-ImprovemenT,PIT)框架,该框架从人类偏好数据中隐含地学习改进目标。PIT仅需用于训练奖励模型的偏好数据,无需额外人工努力。具体而言,我们重新定义了基于人类反馈的强化学习(RLHF)的训练目标——不再针对给定输入最大化响应质量,而是基于参考响应最大化响应质量的差距。通过这种方式,PIT隐含地训练了与人类偏好更好对齐的改进目标。在两个真实数据集和一个合成数据集上的实验表明,我们的方法显著优于基于提示的方法。