Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction

Efforts to align Large Language Models (LLMs) are mainly conducted via Reinforcement Learning from Human Feedback (RLHF) methods. However, RLHF encounters major challenges including training reward models, actor-critic engineering, and importantly, it requires access to LLM parameters. Here we introduce Aligner, a new efficient alignment paradigm that bypasses the whole RLHF process by learning the correctional residuals between the aligned and the unaligned answers. Our Aligner offers several key advantages. Firstly, it is an autoregressive seq2seq model that is trained on the query-answer-correction dataset via supervised learning; this offers a parameter-efficient alignment solution with minimal resources. Secondly, the Aligner facilitates weak-to-strong generalization; finetuning large pretrained models by Aligner's supervisory signals demonstrates strong performance boost. Thirdly, Aligner functions as a model-agnostic plug-and-play module, allowing for its direct application on different open-source and API-based models. Remarkably, Aligner-7B improves 11 different LLMs by 21.9% in helpfulness and 23.8% in harmlessness on average (GPT-4 by 17.5% and 26.9%). When finetuning (strong) Llama2-70B with (weak) Aligner-13B's supervision, we can improve Llama2 by 8.2% in helpfulness and 61.6% in harmlessness. See our dataset and code at https://aligner2024.github.io

翻译：大语言模型（LLMs）对齐工作主要通过基于人类反馈的强化学习（RLHF）方法进行。然而，RLHF面临重大挑战，包括训练奖励模型、actor-critic工程，且关键是需要访问LLM参数。本文提出Aligner，一种通过学习对齐答案与未对齐答案之间的校正残差、绕过整个RLHF流程的高效对齐新范式。我们的Aligner具有以下关键优势：首先，它是一个自回归序列到序列模型，通过监督学习在查询-答案-校正数据集上训练，提供参数高效的对齐方案且资源需求极低。其次，Aligner能够实现弱至强泛化——使用Aligner的监督信号微调大型预训练模型可显著提升性能。第三，Aligner可作为模型无关的即插即用模块，直接应用于不同开源和基于API的模型。值得注意的是，Aligner-7B在11个不同LLM上平均提升21.9%的有用性和23.8%的无害性（GPT-4分别提升17.5%和26.9%）。当使用（弱）Aligner-13B的监督信号微调（强）Llama2-70B时，Llama2的有用性提升8.2%，无害性提升61.6%。数据集和代码见https://aligner2024.github.io

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日