Learning from examples of success is an appealing approach to reinforcement learning that eliminates many of the disadvantages of using hand-crafted reward functions or full expert-demonstration trajectories, both of which can be difficult to acquire, biased, or suboptimal. However, learning from examples alone dramatically increases the exploration challenge, especially for complex tasks. This work introduces value-penalized auxiliary control from examples (VPACE); we significantly improve exploration in example-based control by adding scheduled auxiliary control and examples of auxiliary tasks. Furthermore, we identify a value-calibration problem, where policy value estimates can exceed their theoretical limits based on successful data. We resolve this problem, which is exacerbated by learning auxiliary tasks, through the addition of an above-success-level value penalty. Across three simulated and one real robotic manipulation environment, and 21 different main tasks, we show that our approach substantially improves learning efficiency. Videos, code, and datasets are available at https://papers.starslab.ca/vpace.
翻译:从成功示例中学习是一种极具吸引力的强化学习方法,它避免了使用手工设计奖励函数或完整专家演示轨迹所带来的诸多缺陷——这两者往往难以获取、存在偏差或并非最优。然而,仅依靠示例学习会显著增加探索难度,对于复杂任务尤为如此。本文提出基于示例的价值惩罚辅助控制方法;我们通过引入计划性辅助控制及辅助任务示例,显著提升了基于示例控制中的探索效率。此外,我们发现了一个价值校准问题:策略价值估计可能超出基于成功数据的理论上限。该问题在学习辅助任务时会进一步加剧,我们通过引入超成功水平价值惩罚机制解决了这一问题。在三个模拟环境与一个真实机器人操作环境、涵盖21项不同主任务的实验中,我们证明该方法能大幅提升学习效率。视频、代码及数据集详见 https://papers.starslab.ca/vpace。