The main challenge in developing effective reinforcement learning (RL) pipelines is often the design and tuning the reward functions. Well-designed shaping reward can lead to significantly faster learning. Naively formulated rewards, however, can conflict with the desired behavior and result in overfitting or even erratic performance if not properly tuned. In theory, the broad class of potential based reward shaping (PBRS) can help guide the learning process without affecting the optimal policy. Although several studies have explored the use of potential based reward shaping to accelerate learning convergence, most have been limited to grid-worlds and low-dimensional systems, and RL in robotics has predominantly relied on standard forms of reward shaping. In this paper, we benchmark standard forms of shaping with PBRS for a humanoid robot. We find that in this high-dimensional system, PBRS has only marginal benefits in convergence speed. However, the PBRS reward terms are significantly more robust to scaling than typical reward shaping approaches, and thus easier to tune.
翻译:开发高效强化学习管道的核心挑战往往在于奖励函数的设计与调优。精心设计的塑形奖励能够显著加速学习进程,而若未能妥善调校,简单设计的奖励函数则可能与期望行为产生冲突,导致过拟合甚至性能异常。理论上,广义的基于势能的奖励塑形方法可在不改变最优策略的前提下引导学习过程。尽管已有研究探索利用基于势能的奖励塑形加速学习收敛,但多数局限于网格世界与低维系统,而机器人领域的强化学习仍主要依赖标准塑形形式。本文针对仿人机器人对标准塑形方法与基于势能的奖励塑形进行基准测试。研究发现,在高维系统中,基于势能的奖励塑形仅对收敛速度有边际提升作用,但其奖励项在缩放鲁棒性方面显著优于典型塑形方法,因此更易于调参。