Recent advancements in large language models have heavily relied on the large reward model from reinforcement learning from human feedback for fine-tuning. However, the use of a single reward model across various domains may not always be optimal, often requiring retraining from scratch when new domain data is introduced. To address these challenges, we explore the utilization of small language models operating in a domain-specific manner based on router mechanisms. Our three approaches are: 1) utilize mixture of experts to form a single reward model by modularizing an internal router and experts, 2) employing external router to select the appropriate reward model from multiple domain-specific models, and 3) the framework reduces parameter size by loading reward models and router adapters onto a single small language model using adapters. Experimental validation underscores the effectiveness of our approach, demonstrating performance comparable to baseline methods while also reducing the total parameter size.
翻译:近年来,大型语言模型的进展在很大程度上依赖于基于人类反馈的强化学习中的大型奖励模型进行微调。然而,在多个不同领域中使用单一奖励模型可能并非总是最优选择,当引入新领域数据时通常需要从头开始重新训练。为应对这些挑战,我们探索了基于路由机制、以领域特定方式运行的小型语言模型的利用。我们的三种方法包括:1)通过模块化内部路由器和专家,利用专家混合构建单一奖励模型;2)使用外部路由器从多个领域特定模型中选择合适的奖励模型;3)通过适配器将奖励模型和路由器适配器加载到单个小型语言模型上,从而减少参数规模。实验验证强调了我们方法的有效性,在保持与基线方法相当性能的同时,显著降低了总参数量。