We consider the problem of multi-objective alignment of foundation models with human preferences, which is a critical step towards helpful and harmless AI systems. However, it is generally costly and unstable to fine-tune large foundation models using reinforcement learning (RL), and the multi-dimensionality, heterogeneity, and conflicting nature of human preferences further complicate the alignment process. In this paper, we introduce Rewards-in-Context (RiC), which conditions the response of a foundation model on multiple rewards in its prompt context and applies supervised fine-tuning for alignment. The salient features of RiC are simplicity and adaptivity, as it only requires supervised fine-tuning of a single foundation model and supports dynamic adjustment for user preferences during inference time. Inspired by the analytical solution of an abstracted convex optimization problem, our dynamic inference-time adjustment method approaches the Pareto-optimal solution for multiple objectives. Empirical evidence demonstrates the efficacy of our method in aligning both Large Language Models (LLMs) and diffusion models to accommodate diverse rewards with only around $10\%$ GPU hours compared with multi-objective RL baseline.
翻译:我们考虑了基础模型与人类偏好的多目标对齐问题,这是构建有益且无害的人工智能系统的关键步骤。然而,使用强化学习(RL)微调大型基础模型通常成本高昂且不稳定,而人类偏好的多维性、异质性和冲突性进一步复杂化了对齐过程。本文提出了“上下文中的奖励”(RiC)方法,该方法将基础模型的响应条件设置为提示上下文中的多个奖励,并应用监督微调进行对齐。RiC的显著特点是简单性和自适应性,因为它只需对单个基础模型进行监督微调,并在推理期间支持用户偏好的动态调整。受抽象凸优化问题解析解的启发,我们的动态推理时调整方法能够趋近于多目标的帕累托最优解。实验证据表明,我们的方法在对齐大型语言模型(LLMs)和扩散模型以适应多样化奖励方面具有有效性,相比多目标强化学习基线,仅需约10%的GPU时长。