Large language models (LLMs) fine-tuned with reinforcement learning from human feedback (RLHF) have been used in some of the most widely deployed AI models to date, such as OpenAI's ChatGPT, Anthropic's Claude, or Meta's LLaMA-2. While there has been significant work developing these methods, our understanding of the benefits and downsides of each stage in RLHF is still limited. To fill this gap, we present an extensive analysis of how each stage of the process (i.e. supervised fine-tuning (SFT), reward modelling, and RLHF) affects two key properties: out-of-distribution (OOD) generalisation and output diversity. OOD generalisation is crucial given the wide range of real-world scenarios in which these models are being used, while output diversity refers to the model's ability to generate varied outputs and is important for a variety of use cases. We perform our analysis across two base models on both summarisation and instruction following tasks, the latter being highly relevant for current LLM use cases. We find that RLHF generalises better than SFT to new inputs, particularly as the distribution shift between train and test becomes larger. However, RLHF significantly reduces output diversity compared to SFT across a variety of measures, implying a tradeoff in current LLM fine-tuning methods between generalisation and diversity. Our results provide guidance on which fine-tuning method should be used depending on the application, and show that more research is needed to improve the trade-off between generalisation and diversity.
翻译:通过人类反馈强化学习(RLHF)微调的大型语言模型(LLM)已被应用于目前最广泛部署的人工智能模型中,例如OpenAI的ChatGPT、Anthropic的Claude或Meta的LLaMA-2。尽管已有大量研究开发这些方法,但我们对RLHF各阶段利弊的理解仍然有限。为填补这一空白,我们对该过程各阶段(即监督微调(SFT)、奖励建模和RLHF)如何影响两个关键特性——分布外(OOD)泛化性和输出多样性——进行了全面分析。OOD泛化性对于这些模型在真实世界中广泛的应用场景至关重要,而输出多样性指模型生成多样化输出的能力,对多种用例具有重要意义。我们基于两个基础模型,在摘要生成和指令遵循任务(后者与当前LLM用例高度相关)上进行了分析。研究发现:RLHF相比SFT能更好地泛化到新输入,尤其当训练集与测试集之间的分布偏移增大时;然而,RLHF在多种指标上显著降低了输出多样性,表明当前LLM微调方法在泛化性与多样性之间存在权衡。我们的结果为根据应用场景选择微调方法提供了指导,并表明需要更多研究来改善泛化性与多样性之间的权衡。