LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits

Reward Models (RMs) play a crucial role in aligning LLMs with human preferences, enhancing their performance by ranking outputs during inference or iterative training. However, the degree to which an RM generalizes to new tasks is often not known a priori (e.g. some RMs may excel at scoring creative writing vs. math reasoning). Therefore, using only one fixed RM while training LLMs can be suboptimal. Moreover, optimizing LLMs with multiple RMs simultaneously can be prohibitively computationally-intensive and challenging due to conflicting signals from different RMs, potentially degrading performance. To address these challenges, we introduce LASeR (Learning to Adaptively Select Rewards), which iteratively trains LLMs using multiple RMs, selecting and utilizing the most well-suited RM for each instance to rank outputs and generate preference data, framed as a multi-armed bandit problem. Our results on commonsense and math reasoning tasks demonstrate that LASeR can boost iterative LLM optimization by optimizing for multiple RMs, improving the absolute average accuracy of Llama-3-8B over three datasets by 2.67% over training with ensemble RM scores while also showing superior training efficiency (e.g., a 2x speedup). Moreover, on WildChat, a benchmark of instruction-following prompts, we find that using Llama-3-8B LASeR leads to a 71.45% AlpacaEval win rate over sequentially optimizing multiple RMs. Extending to long-context generation tasks, we find that on Llama-3-8B, LASeR achieves an average improvement of 2.64 F1 and 2.42 F1 on single- and multi-document QA over random RM selection when used with best-of-n sampling. LASeR is robust to noisy rewards and generalizes to multiple settings. Finally, LASeR's RM selection changes depending on the underlying task or instance and we verify the presence of conflicting preferences from multiple RMs that can be mitigated using LASeR.

翻译：奖励模型在将大型语言模型与人类偏好对齐方面发挥着关键作用，通过在对齐推理或迭代训练中对输出进行排序来提升模型性能。然而，奖励模型对新任务的泛化程度通常无法先验获知（例如，某些奖励模型可能在创造性写作评分上表现出色，而在数学推理评分上则不然）。因此，在训练大型语言模型时仅使用一个固定的奖励模型可能是次优的。此外，同时使用多个奖励模型优化大型语言模型可能因计算量过大而不可行，并且由于不同奖励模型可能提供相互冲突的信号，这可能导致性能下降。为应对这些挑战，我们提出了LASeR（学习自适应选择奖励），该方法将多奖励模型选择问题构建为多臂老虎机问题，在迭代训练大型语言模型时，为每个实例选择并利用最合适的奖励模型来对输出进行排序并生成偏好数据。我们在常识推理和数学推理任务上的实验结果表明，LASeR能够通过优化多个奖励模型来提升迭代式大型语言模型优化效果，相较于使用集成奖励模型分数进行训练，Llama-3-8B在三个数据集上的绝对平均准确率提升了2.67%，同时展现出更优的训练效率（例如，实现2倍加速）。此外，在指令跟随提示基准WildChat上，我们发现使用Llama-3-8B LASeR相较于顺序优化多个奖励模型，在AlpacaEval上获得了71.45%的胜率。在扩展到长上下文生成任务时，我们发现对于Llama-3-8B，当结合最优n采样使用时，LASeR在单文档和多文档问答任务上相较于随机选择奖励模型分别平均提升了2.64和2.42个F1分数。LASeR对噪声奖励具有鲁棒性，并能泛化到多种设置。最后，LASeR的奖励模型选择会随底层任务或实例而变化，我们验证了多个奖励模型间存在相互冲突的偏好，而使用LASeR可以缓解这一问题。