Many real-world robot learning problems, such as pick-and-place or arriving at a destination, can be seen as a problem of reaching a goal state as soon as possible. These problems, when formulated as episodic reinforcement learning tasks, can easily be specified to align well with our intended goal: -1 reward every time step with termination upon reaching the goal state, called minimum-time tasks. Despite this simplicity, such formulations are often overlooked in favor of dense rewards due to their perceived difficulty and lack of informativeness. Our studies contrast the two reward paradigms, revealing that the minimum-time task specification not only facilitates learning higher-quality policies but can also surpass dense-reward-based policies on their own performance metrics. Crucially, we also identify the goal-hit rate of the initial policy as a robust early indicator for learning success in such sparse feedback settings. Finally, using four distinct real-robotic platforms, we show that it is possible to learn pixel-based policies from scratch within two to three hours using constant negative rewards.
翻译:许多现实世界中的机器人学习问题,例如抓取放置或抵达目的地,均可视为尽快到达目标状态的问题。当这些问题被表述为分段式强化学习任务时,可以很容易地设定为与我们的预期目标高度一致:在达到目标状态前每个时间步给予-1奖励,达到后终止,称为最短时间任务。尽管形式简单,此类设定常因被认为难度较高且信息量不足而被忽视,研究者更倾向于使用密集奖励。本研究对比了两种奖励范式,结果表明最短时间任务设定不仅有助于学习更高质量的策略,甚至能在密集奖励策略自身的性能指标上超越后者。关键的是,我们还发现初始策略的目标命中率可作为此类稀疏反馈场景中学习成功的稳健早期指标。最后,通过在四个不同的真实机器人平台上进行实验,我们证明了使用恒定负奖励在二至三小时内从零开始学习基于像素的策略是可行的。