Preference-based reinforcement learning offers a scalable alternative to manual reward engineering by learning reward structures from comparative feedback. However, large-scale preference datasets, whether collected from crowdsourced annotators or generated by synthetic teachers, often contain heterogeneous and partially conflicting supervision, including disagreement across annotators and inconsistency within annotators. Existing reward learning methods typically fit a single reward model to such data, forcing it to average incompatible signals and thereby limiting robustness. To solve this, we propose PrefMoE, a mixture-of-experts reward learning framework for robust preference modeling. PrefMoE learns multiple specialized reward experts and uses trajectory-level soft routing to combine them adaptively, enabling the model to capture diverse latent preference patterns under noisy and heterogeneous preference supervision. A load-balancing regularizer further stabilizes training by preventing expert collapse. Across locomotion benchmarks from D4RL and manipulation tasks from MetaWorld, PrefMoE improves preference prediction robustness and leads to more reliable downstream policy learning than strong single-model baselines.
翻译:基于偏好的强化学习通过从比较反馈中学习奖励结构,为手动奖励工程设计提供了一种可扩展的替代方案。然而,大规模偏好数据集——无论是众包标注员收集还是合成教师生成——通常包含异构且部分矛盾的监督信号,包括标注员间分歧和标注员内部不一致。现有奖励学习方法通常将单一奖励模型拟合至此类数据,迫使其平均不可兼容的信号,从而限制了鲁棒性。为解决此问题,我们提出PrefMoE——一种用于鲁棒偏好建模的混合专家奖励学习框架。PrefMoE学习多个专业奖励专家,并通过轨迹级软路由自适应组合它们,使模型能够在含噪异构偏好监督下捕获多样化的潜在偏好模式。负载均衡正则化器通过防止专家崩溃进一步稳定训练。在D4RL的 locomotion 基准测试和MetaWorld的操作任务上,PrefMoE相比强单模型基线,提升了偏好预测鲁棒性,并带来更可靠的下游策略学习。