Reward Models (RMs) are crucial to aligning large language models (LLMs), but the degree to which an RM specialized to one task (e.g. writing) generalizes to new tasks (e.g. math) is often not known a priori, often making using only one fixed RM to train LLMs suboptimal. However, optimizing LLMs with multiple RMs simultaneously can incur a prohibitively high computational cost and lead to conflicting signals from different RMs that may degrade performance. To address these challenges, we introduce LASeR (Learning to Adaptively Select Rewards), which frames reward model selection as a multi-armed bandit problem, efficiently and iteratively training LLMs using multiple RMs by selecting the most well-suited RM for each instance. On commonsense and math reasoning tasks, we show that LASeR boosts iterative LLM training, improving the absolute average accuracy of Llama-3-8B over three datasets by 2.67% over an ensemble of RM scores while also showing superior efficiency (e.g., a 2x speedup). Moreover, on WildChat (open-ended instruction-following tasks), LASeR leads to a 72.69% AlpacaEval win rate over the RM score ensemble baseline. Extending to long-context generation, LASeR improves by 2.96 F1 points (avg.) on single-document QA tasks and 2.97 F1 points on few-shot learning over the RM score ensemble baseline with best-of-n sampling.
翻译:奖励模型对于大型语言模型的校准至关重要,但针对特定任务(如写作)训练的奖励模型对新任务(如数学)的泛化能力通常无法先验获知,这导致仅使用单一固定奖励模型训练大型语言模型往往难以达到最优效果。然而,同时使用多个奖励模型优化大型语言模型会产生极高的计算成本,且不同奖励模型可能产生冲突信号,进而损害模型性能。为应对这些挑战,本文提出LASeR(奖励模型自适应选择学习框架),将奖励模型选择问题构建为多臂老虎机问题,通过为每个训练实例动态选择最适配的奖励模型,实现多奖励模型的高效迭代训练。在常识推理与数学推理任务上的实验表明,LASeR能显著提升大型语言模型的迭代训练效果:Llama-3-8B模型在三个数据集上的绝对平均准确率较奖励模型集成方法提升2.67%,同时展现出更优的训练效率(例如实现2倍加速)。在WildChat开放域指令跟随任务中,LASeR相较于奖励模型集成基线在AlpacaEval评测中获得72.69%的胜率。扩展至长文本生成任务时,LASeR在单文档问答任务上较基于最优n采样策略的奖励模型集成基线平均提升2.96个F1值,在少样本学习任务上平均提升2.97个F1值。