Trust, Don't Trust, or Flip: Robust Preference-Based Reinforcement Learning with Multi-Expert Feedback

Preference-based reinforcement learning (PBRL) offers a promising alternative to explicit reward engineering by learning from pairwise trajectory comparisons. However, real-world preference data often comes from heterogeneous annotators with varying reliability; some accurate, some noisy, and some systematically adversarial. Existing PBRL methods either treat all feedback equally or attempt to filter out unreliable sources, but both approaches fail when faced with adversarial annotators who systematically provide incorrect preferences. We introduce TriTrust-PBRL (TTP), a unified framework that jointly learns a shared reward model and expert-specific trust parameters from multi-expert preference feedback. The key insight is that trust parameters naturally evolve during gradient-based optimization to be positive (trust), near zero (ignore), or negative (flip), enabling the model to automatically invert adversarial preferences and recover useful signal rather than merely discarding corrupted feedback. We provide theoretical analysis establishing identifiability guarantees and detailed gradient analysis that explains how expert separation emerges naturally during training without explicit supervision. Empirically, we evaluate TTP on four diverse domains spanning manipulation tasks (MetaWorld) and locomotion (DM Control) under various corruption scenarios. TTP achieves state-of-the-art robustness, maintaining near-oracle performance under adversarial corruption while standard PBRL methods fail catastrophically. Notably, TTP outperforms existing baselines by successfully learning from mixed expert pools containing both reliable and adversarial annotators, all while requiring no expert features beyond identification indices and integrating seamlessly with existing PBRL pipelines.

翻译：基于偏好的强化学习（PBRL）通过从成对轨迹比较中学习，为显式奖励工程提供了一种有前景的替代方案。然而，现实世界中的偏好数据通常来自具有不同可靠性的异构标注者：有些准确，有些带有噪声，有些则具有系统性对抗性。现有的PBRL方法要么平等对待所有反馈，要么试图过滤掉不可靠的来源，但当面对系统性提供错误偏好的对抗性标注者时，这两种方法都会失效。我们提出了TriTrust-PBRL（TTP），这是一个统一的框架，能够从多专家偏好反馈中联合学习共享奖励模型和专家特定的信任参数。其核心洞见在于，信任参数在基于梯度的优化过程中会自然地演变为正值（信任）、接近零（忽略）或负值（反转），这使得模型能够自动反转对抗性偏好并恢复有用信号，而不仅仅是丢弃被污染的反馈。我们提供了理论分析，建立了可识别性保证，并进行了详细的梯度分析，解释了专家分离如何在训练过程中自然出现而无需显式监督。在实证方面，我们在多种污染场景下，在涵盖操作任务（MetaWorld）和运动控制（DM Control）的四个不同领域中对TTP进行了评估。TTP实现了最先进的鲁棒性，在对抗性污染下保持了接近最优的性能，而标准的PBRL方法则完全失效。值得注意的是，TTP通过成功地从包含可靠和对抗性标注者的混合专家池中学习，超越了现有基线方法，并且除了识别索引外不需要任何专家特征，同时能与现有的PBRL流程无缝集成。