Efforts to align Large Language Models (LLMs) are mainly conducted via Reinforcement Learning from Human Feedback (RLHF) methods. However, RLHF encounters major challenges including training reward models, actor-critic engineering, and importantly, it requires access to LLM parameters. Here we introduce Aligner, a new efficient alignment paradigm that bypasses the whole RLHF process by learning the correctional residuals between the aligned and the unaligned answers. Our Aligner offers several key advantages. Firstly, it is an autoregressive seq2seq model that is trained on the query-answer-correction dataset via supervised learning; this offers a parameter-efficient alignment solution with minimal resources. Secondly, the Aligner facilitates weak-to-strong generalization; finetuning large pretrained models by Aligner's supervisory signals demonstrates strong performance boost. Thirdly, Aligner functions as a model-agnostic plug-and-play module, allowing for its direct application on different open-source and API-based models. Remarkably, Aligner-7B improves 11 different LLMs by 21.9% in helpfulness and 23.8% in harmlessness on average (GPT-4 by 17.5% and 26.9%). When finetuning (strong) Llama2-70B with (weak) Aligner-13B's supervision, we can improve Llama2 by 8.2% in helpfulness and 61.6% in harmlessness. See our dataset and code at https://aligner2024.github.io
翻译:大语言模型(LLMs)对齐工作主要通过基于人类反馈的强化学习(RLHF)方法进行。然而,RLHF面临重大挑战,包括训练奖励模型、actor-critic工程,且关键是需要访问LLM参数。本文提出Aligner,一种通过学习对齐答案与未对齐答案之间的校正残差、绕过整个RLHF流程的高效对齐新范式。我们的Aligner具有以下关键优势:首先,它是一个自回归序列到序列模型,通过监督学习在查询-答案-校正数据集上训练,提供参数高效的对齐方案且资源需求极低。其次,Aligner能够实现弱至强泛化——使用Aligner的监督信号微调大型预训练模型可显著提升性能。第三,Aligner可作为模型无关的即插即用模块,直接应用于不同开源和基于API的模型。值得注意的是,Aligner-7B在11个不同LLM上平均提升21.9%的有用性和23.8%的无害性(GPT-4分别提升17.5%和26.9%)。当使用(弱)Aligner-13B的监督信号微调(强)Llama2-70B时,Llama2的有用性提升8.2%,无害性提升61.6%。数据集和代码见https://aligner2024.github.io