Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes

As large language models (LLMs) become increasingly embedded in everyday applications, ensuring their alignment with the diverse preferences of individual users has become a critical challenge. Currently deployed approaches typically assume homogeneous user objectives and rely on single-objective fine-tuning. However, human preferences are inherently heterogeneous, influenced by various unobservable factors, leading to conflicting signals in preference data. Existing solutions addressing this diversity often require costly datasets labelled for specific objectives and involve training multiple reward models or LLM policies, which is computationally expensive and impractical. In this work, we present a novel framework for few-shot steerable alignment, where users' underlying preferences are inferred from a small sample of their choices. To achieve this, we extend the Bradley-Terry-Luce model to handle heterogeneous preferences with unobserved variability factors and propose its practical implementation for reward modelling and LLM fine-tuning. Thanks to our proposed approach of functional parameter-space conditioning, LLMs trained with our framework can be adapted to individual preferences at inference time, generating outputs over a continuum of behavioural modes. We empirically validate the effectiveness of methods, demonstrating their ability to capture and align with diverse human preferences in a data-efficient manner. Our code is made available at: https://github.com/kasia-kobalczyk/few-shot-steerable-alignment.

翻译：随着大型语言模型（LLM）日益融入日常应用，确保其与个体用户多样化偏好的对齐已成为关键挑战。当前部署的方法通常假设用户目标同质化，并依赖于单目标微调。然而，人类偏好本质上是异质性的，受多种不可观测因素影响，导致偏好数据中存在冲突信号。现有解决这种多样性的方案通常需要针对特定目标标注的高成本数据集，并涉及训练多个奖励模型或LLM策略，这在计算上昂贵且不切实际。本研究提出一种新颖的少样本可引导对齐框架，其中用户潜在偏好通过其选择的少量样本进行推断。为实现这一目标，我们将Bradley-Terry-Luce模型扩展至处理具有未观测变异因素的异质偏好，并提出其在奖励建模和LLM微调中的实际实现方案。得益于我们提出的函数参数空间条件化方法，通过本框架训练的LLM可在推理阶段适配个体偏好，在连续行为模式上生成输出。我们通过实证验证了方法的有效性，证明了其能以数据高效的方式捕捉并适应多样化人类偏好的能力。代码已发布于：https://github.com/kasia-kobalczyk/few-shot-steerable-alignment。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日