Preference-based reinforcement learning (PbRL) enables agents to learn control policies without requiring manually designed reward functions, making it well-suited for tasks where objectives are difficult to formalize or inherently subjective. Acrobatic flight poses a particularly challenging problem due to its complex dynamics, rapid movements, and the importance of precise execution. However, manually designed reward functions for such tasks often fail to capture the qualities that matter: we find that hand-crafted rewards agree with human judgment only 60.7% of the time, underscoring the need for preference-driven approaches. In this work, we propose Reward Ensemble under Confidence (REC), a probabilistic reward learning framework for PbRL that explicitly models per-timestep reward uncertainty through an ensemble of distributional reward models. By propagating uncertainty into the preference loss and leveraging disagreement for exploration, REC achieves 88.4% of shaped reward performance on acrobatic quadrotor control, compared to 55.2% with standard Preference PPO. We train policies in simulation and successfully transfer them zero-shot to the real world, demonstrating complex acrobatic maneuvers learned purely from preference feedback. We further validate REC on a continuous control benchmark, confirming its applicability beyond the domain of aerial robotics.
翻译:基于偏好的强化学习(PbRL)使智能体能够在无需手动设计奖励函数的情况下学习控制策略,这使其特别适用于目标难以形式化或本质上是主观的任务。杂技飞行因其复杂的动力学、快速的动作以及精确执行的重要性,构成了一个极具挑战性的问题。然而,为此类任务手动设计的奖励函数通常无法捕捉到关键的质量要素:我们发现手工制作的奖励与人类判断的一致性仅为60.7%,这凸显了对偏好驱动方法的需求。在本工作中,我们提出了置信度下的奖励集成(REC),这是一个用于PbRL的概率奖励学习框架,它通过一个分布奖励模型集成来显式建模每个时间步的奖励不确定性。通过将不确定性传播到偏好损失中,并利用模型间的分歧进行探索,REC在杂技四旋翼飞行器控制任务上达到了人工设计奖励性能的88.4%,而标准的偏好PPO方法仅达到55.2%。我们在仿真环境中训练策略,并成功将其零样本迁移到现实世界,展示了完全从偏好反馈中学习到的复杂杂技机动动作。我们还在一个连续控制基准测试上进一步验证了REC,确认了其适用范围可扩展到空中机器人领域之外。