LASeR：基于多臂老虎机实现奖励模型自适应选择的学习方法 (LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits)

Reward Models (RMs) are crucial to aligning large language models (LLMs), but the degree to which an RM specialized to one task (e.g. writing) generalizes to new tasks (e.g. math) is often not known a priori, often making using only one fixed RM to train LLMs suboptimal. However, optimizing LLMs with multiple RMs simultaneously can incur a prohibitively high computational cost and lead to conflicting signals from different RMs that may degrade performance. To address these challenges, we introduce LASeR (Learning to Adaptively Select Rewards), which frames reward model selection as a multi-armed bandit problem, efficiently and iteratively training LLMs using multiple RMs by selecting the most well-suited RM for each instance. On commonsense and math reasoning tasks, we show that LASeR boosts iterative LLM training, improving the absolute average accuracy of Llama-3-8B over three datasets by 2.67% over an ensemble of RM scores while also showing superior efficiency (e.g., a 2x speedup). Moreover, on WildChat (open-ended instruction-following tasks), LASeR leads to a 72.69% AlpacaEval win rate over the RM score ensemble baseline. Extending to long-context generation, LASeR improves by 2.96 F1 points (avg.) on single-document QA tasks and 2.97 F1 points on few-shot learning over the RM score ensemble baseline with best-of-n sampling.

翻译：奖励模型对于大型语言模型的校准至关重要，但针对特定任务（如写作）训练的奖励模型对新任务（如数学）的泛化能力通常无法先验获知，这导致仅使用单一固定奖励模型训练大型语言模型往往难以达到最优效果。然而，同时使用多个奖励模型优化大型语言模型会产生极高的计算成本，且不同奖励模型可能产生冲突信号，进而损害模型性能。为应对这些挑战，本文提出LASeR（奖励模型自适应选择学习框架），将奖励模型选择问题构建为多臂老虎机问题，通过为每个训练实例动态选择最适配的奖励模型，实现多奖励模型的高效迭代训练。在常识推理与数学推理任务上的实验表明，LASeR能显著提升大型语言模型的迭代训练效果：Llama-3-8B模型在三个数据集上的绝对平均准确率较奖励模型集成方法提升2.67%，同时展现出更优的训练效率（例如实现2倍加速）。在WildChat开放域指令跟随任务中，LASeR相较于奖励模型集成基线在AlpacaEval评测中获得72.69%的胜率。扩展至长文本生成任务时，LASeR在单文档问答任务上较基于最优n采样策略的奖励模型集成基线平均提升2.96个F1值，在少样本学习任务上平均提升2.97个F1值。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日