Conveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preference-based RL (PbRL) offers a promising alternative by learning reward functions from human feedback, but its scalability is hindered by high labeling costs. Inspired by advances in Video Foundation Models (ViFMs), we present Video-based Optimal Transport Preference (VOTP), a semi-supervised framework that learns effective reward functions from only a handful of labels. By leveraging optimal transport to align visual trajectories within the rich representation space of ViFMs, VOTP effectively generates high-fidelity pseudo-labels for large amounts of unlabeled data, substantially reducing human supervision. Extensive experiments across locomotion and manipulation benchmarks demonstrate the superiority of VOTP, which outperforms state-of-the-art offline PbRL methods under limited feedback budgets. We also showcase the robustness of VOTP in the presence of visual distractors and validate its utility on real robotic tasks, where it learns meaningful rewards with minimal human input.
翻译:向强化学习智能体传达复杂目标通常需要精心设计奖励函数。偏好强化学习通过从人类反馈中学习奖励函数提供了一种有前景的替代方案,但其可扩展性受限于高昂的标注成本。受视频基础模型领域进展的启发,我们提出基于视频的最优传输偏好框架——一种仅需少量标签即可学习有效奖励函数的半监督方法。通过利用最优传输技术在视频基础模型的丰富表征空间中对齐视觉轨迹,VOTP能高效地为大量无标签数据生成高保真伪标签,从而显著减少人工监督。在运动控制与操作基准上的广泛实验表明,VOTP在有限反馈预算条件下优于现有最先进的离线偏好强化学习方法。我们还验证了VOTP在面对视觉干扰时的鲁棒性,并在真实机器人任务中验证其实用价值——仅需极少人工输入即可学习有意义的奖励函数。