During the last stage of RLHF, a large language model is aligned to human intents via PPO training, a process that generally requires large-scale computational resources. In this technical report, we empirically investigate an efficient implementation of RLHF using low-rank adaptation (LoRA), which allows us to align the LLaMA 7B checkpoint on the Alpaca dataset using only two A100 GPUs instead of the eight required for full model fine-tuning. Despite tuning only 0.2% of LLaMA 7B's parameters, our implementation achieves better performance than the publicly-released AlpacaFarm checkpoint with full model fine-tuning. Next, we analyze several configurations of our LoRA-based PPO implementation, varying the form of the KL regularization term in the training objective. We find that (1) removing this penalty term does not harm performance on the AlpacaFarm evaluation set under our LoRA setup; (2) other regularizers, such as Jensen-Shannon divergence, lead to improved performance; and (3) while PPO training negatively impacts the factuality of model-generated responses, training with LoRA largely mitigates this effect. We release our code and pretrained checkpoints to facilitate future research on more efficient RLHF.
翻译:在RLHF的最后阶段,通过PPO训练将大型语言模型与人类意图对齐,这一过程通常需要大规模计算资源。本技术报告通过实验研究了一种使用低秩适配(LoRA)的高效RLHF实现方法,该方法仅需两块A100 GPU即可在Alpaca数据集上对齐LLaMA 7B检查点,而全模型微调需要八块GPU。尽管仅调整了LLaMA 7B参数的0.2%,我们的实现仍取得了优于公开发布的AlpacaFarm检查点(采用全模型微调)的性能。随后,我们分析了LoRA基PPO实现的几种配置,改变了训练目标中KL正则化项的形式。我们发现:(1)在LoRA设置下,移除该惩罚项不会损害AlpacaFarm评估集的性能;(2)其他正则化器(如詹森-香农散度)可提升性能;(3)虽然PPO训练对模型生成回复的事实性产生负面影响,但使用LoRA训练能大幅缓解该效应。我们公开了代码与预训练检查点,以促进未来更高效RLHF的研究。