As large language models (LLMs) become increasingly embedded in everyday applications, ensuring their alignment with the diverse preferences of individual users has become a critical challenge. Currently deployed approaches typically assume homogeneous user objectives and rely on single-objective fine-tuning. However, human preferences are inherently heterogeneous, influenced by various unobservable factors, leading to conflicting signals in preference data. Existing solutions addressing this diversity often require costly datasets labelled for specific objectives and involve training multiple reward models or LLM policies, which is computationally expensive and impractical. In this work, we present a novel framework for few-shot steerable alignment, where users' underlying preferences are inferred from a small sample of their choices. To achieve this, we extend the Bradley-Terry-Luce model to handle heterogeneous preferences with unobserved variability factors and propose its practical implementation for reward modelling and LLM fine-tuning. Thanks to our proposed approach of functional parameter-space conditioning, LLMs trained with our framework can be adapted to individual preferences at inference time, generating outputs over a continuum of behavioural modes. We empirically validate the effectiveness of methods, demonstrating their ability to capture and align with diverse human preferences in a data-efficient manner. Our code is made available at: https://github.com/kasia-kobalczyk/few-shot-steerable-alignment.
翻译:随着大型语言模型(LLM)日益融入日常应用,确保其与个体用户多样化偏好的对齐已成为关键挑战。当前部署的方法通常假设用户目标同质化,并依赖于单目标微调。然而,人类偏好本质上是异质性的,受多种不可观测因素影响,导致偏好数据中存在冲突信号。现有解决这种多样性的方案通常需要针对特定目标标注的高成本数据集,并涉及训练多个奖励模型或LLM策略,这在计算上昂贵且不切实际。本研究提出一种新颖的少样本可引导对齐框架,其中用户潜在偏好通过其选择的少量样本进行推断。为实现这一目标,我们将Bradley-Terry-Luce模型扩展至处理具有未观测变异因素的异质偏好,并提出其在奖励建模和LLM微调中的实际实现方案。得益于我们提出的函数参数空间条件化方法,通过本框架训练的LLM可在推理阶段适配个体偏好,在连续行为模式上生成输出。我们通过实证验证了方法的有效性,证明了其能以数据高效的方式捕捉并适应多样化人类偏好的能力。代码已发布于:https://github.com/kasia-kobalczyk/few-shot-steerable-alignment。