SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF

Model alignment with human preferences is an essential step in making Large Language Models (LLMs) helpful and consistent with human values. It typically consists of supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) stages. However, RLHF faces inherent limitations stemming from a complex training setup and its tendency to align the model with implicit values that end users cannot control at run-time. Moreover, reward models in RLHF stage commonly rely on single-dimensional feedback as opposed to explicit, multifaceted signals that indicate attributes such as helpfulness, humor, and toxicity. To address these limitations, we propose SteerLM, a supervised fine-tuning method that empowers end-users to control responses during inference. SteerLM conditions responses to conform to an explicitly defined multi-dimensional set of attributes, thereby empowering a steerable AI capable of generating helpful and high-quality responses while maintaining customizability. Experiments show that SteerLM trained on open source datasets generates responses that are preferred by human and automatic evaluators to many state-of-the-art baselines trained with RLHF while being much easier to train. Try SteerLM at https://huggingface.co/nvidia/SteerLM-llama2-13B

翻译：模型与人类偏好的对齐是使大语言模型（LLMs）符合人类价值观的关键步骤。传统方法通常包含监督微调（SFT）和基于人类反馈的强化学习（RLHF）两个阶段。然而，RLHF存在固有限制：其训练流程复杂，且倾向于使模型隐式对齐用户在运行时无法控制的隐含价值观。此外，RLHF阶段的奖励模型通常依赖单维度反馈，而非显式的多维度信号（如有用性、幽默感、毒性等）。为解决这些问题，我们提出SteerLM——一种允许终端用户在推理时控制模型输出的监督微调方法。SteerLM使模型响应满足明确定义的多维属性约束，从而赋予AI可控制性，使其既能生成有用且高质量的响应，又保持可定制性。实验表明，在开源数据集上训练的SteerLM模型，其生成结果在人类评估和自动评估中均优于多数基于RLHF的最先进基线模型，且训练过程更加简便。请访问https://huggingface.co/nvidia/SteerLM-llama2-13B 体验SteerLM。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日