Model alignment with human preferences is an essential step in making Large Language Models (LLMs) helpful and consistent with human values. It typically consists of supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) stages. However, RLHF faces inherent limitations stemming from a complex training setup and its tendency to align the model with implicit values that end users cannot control at run-time. Moreover, reward models in RLHF stage commonly rely on single-dimensional feedback as opposed to explicit, multifaceted signals that indicate attributes such as helpfulness, humor, and toxicity. To address these limitations, we propose SteerLM, a supervised fine-tuning method that empowers end-users to control responses during inference. SteerLM conditions responses to conform to an explicitly defined multi-dimensional set of attributes, thereby empowering a steerable AI capable of generating helpful and high-quality responses while maintaining customizability. Experiments show that SteerLM trained on open source datasets generates responses that are preferred by human and automatic evaluators to many state-of-the-art baselines trained with RLHF while being much easier to train. Try SteerLM at https://huggingface.co/nvidia/SteerLM-llama2-13B
翻译:模型与人类偏好的对齐是使大语言模型(LLMs)符合人类价值观的关键步骤。传统方法通常包含监督微调(SFT)和基于人类反馈的强化学习(RLHF)两个阶段。然而,RLHF存在固有限制:其训练流程复杂,且倾向于使模型隐式对齐用户在运行时无法控制的隐含价值观。此外,RLHF阶段的奖励模型通常依赖单维度反馈,而非显式的多维度信号(如有用性、幽默感、毒性等)。为解决这些问题,我们提出SteerLM——一种允许终端用户在推理时控制模型输出的监督微调方法。SteerLM使模型响应满足明确定义的多维属性约束,从而赋予AI可控制性,使其既能生成有用且高质量的响应,又保持可定制性。实验表明,在开源数据集上训练的SteerLM模型,其生成结果在人类评估和自动评估中均优于多数基于RLHF的最先进基线模型,且训练过程更加简便。请访问https://huggingface.co/nvidia/SteerLM-llama2-13B 体验SteerLM。