Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned on labeled data. Reinforcement learning, notably from human feedback (RLHF), can further align the network with the intended usage. Yet the imperfections in the proxy reward may hinder the training and lead to suboptimal results; the diversity of objectives in real-world tasks and human opinions exacerbate the issue. This paper proposes embracing the heterogeneity of diverse rewards by following a multi-policy strategy. Rather than focusing on a single a priori reward, we aim for Pareto-optimal generalization across the entire space of preferences. To this end, we propose rewarded soup, first specializing multiple networks independently (one for each proxy reward) and then interpolating their weights linearly. This succeeds empirically because we show that the weights remain linearly connected when fine-tuned on diverse rewards from a shared pre-trained initialization. We demonstrate the effectiveness of our approach for text-to-text (summarization, Q&A, helpful assistant, review), text-image (image captioning, text-to-image generation, visual grounding, VQA), and control (locomotion) tasks. We hope to enhance the alignment of deep models, and how they interact with the world in all its diversity.
翻译:基础模型首先在海量无监督数据集上进行预训练,随后在标注数据上微调。强化学习(特别是基于人类反馈的RLHF)可进一步使网络符合预期用途。然而代理奖励中的缺陷可能阻碍训练并导致次优结果;现实任务和人类意见中目标的多样性加剧了这一问题。本文提出通过多策略方法接纳多样化奖励的异质性,而非聚焦于单一先验奖励。我们致力于在整个偏好空间上实现帕累托最优泛化。为此,我们提出奖励汤策略:首先独立特化多个网络(每个代理奖励对应一个网络),随后线性插值其权重。我们证明从共享预训练初始化出发,在多样化奖励上微调后的权重仍保持线性连通性,这在实证中取得了成功。我们在文本到文本(摘要、问答、辅助助手、评论)、文本图像(图像描述、文本到图像生成、视觉定位、VQA)以及控制(运动控制)任务上展示了该方法的有效性。我们期待能改进深度模型的对齐方式,及其与多元化世界交互的能力。