Large language models (LLMs) fine-tuned with reinforcement learning from human feedback (RLHF) have been used in some of the most widely deployed AI models to date, such as OpenAI's ChatGPT or Anthropic's Claude. While there has been significant work developing these methods, our understanding of the benefits and downsides of each stage in RLHF is still limited. To fill this gap, we present an extensive analysis of how each stage of the process (i.e. supervised fine-tuning (SFT), reward modelling, and RLHF) affects two key properties: out-of-distribution (OOD) generalisation and output diversity. OOD generalisation is crucial given the wide range of real-world scenarios in which these models are being used, while output diversity refers to the model's ability to generate varied outputs and is important for a variety of use cases. We perform our analysis across two base models on both summarisation and instruction following tasks, the latter being highly relevant for current LLM use cases. We find that RLHF generalises better than SFT to new inputs, particularly as the distribution shift between train and test becomes larger. However, RLHF significantly reduces output diversity compared to SFT across a variety of measures, implying a tradeoff in current LLM fine-tuning methods between generalisation and diversity. Our results provide guidance on which fine-tuning method should be used depending on the application, and show that more research is needed to improve the tradeoff between generalisation and diversity.
翻译:基于人类反馈的强化学习(RLHF)微调的大型语言模型(LLM)已被应用于目前最广泛部署的AI模型中,例如OpenAI的ChatGPT或Anthropic的Claude。尽管已经开展了大量关于这些方法的研究,但我们对RLHF各阶段优缺点的理解仍然有限。为弥补这一空白,我们对流程中的每个阶段(即监督微调(SFT)、奖励建模和RLHF)如何影响两个关键特性——分布外(OOD)泛化能力和输出多样性——进行了全面分析。鉴于这些模型被应用于广泛的现实场景,分布外泛化能力至关重要;而输出多样性则指模型生成多样化输出的能力,对多种应用场景具有重要意义。我们基于两个基础模型,在摘要生成和指令遵循任务上进行了分析,后者与当前LLM的应用高度相关。研究发现,与SFT相比,RLHF在新输入上的泛化能力更强,尤其在训练集与测试集分布差异较大时更为显著。然而,RLHF在多种指标下显著降低了输出多样性,这表明当前的LLM微调方法在泛化性与多样性之间存在权衡。我们的结果为根据具体应用选择微调方法提供了指导,并表明需要进一步研究以改善泛化性与多样性之间的权衡关系。