Robot control using reinforcement learning has become popular, but its learning process generally terminates halfway through an episode for safety and time-saving reasons. This study addresses the problem of the most popular exception handling that temporal-difference (TD) learning performs at such termination. That is, by forcibly assuming zero value after termination, unintentionally implicit underestimation or overestimation occurs, depending on the reward design in the normal states. When the episode is terminated due to task failure, the failure may be highly valued with the unintentional overestimation, and the wrong policy may be acquired. Although this problem can be avoided by paying attention to the reward design, it is essential in practical use of TD learning to review the exception handling at termination. This paper therefore proposes a method to intentionally underestimate the value after termination to avoid learning failures due to the unintentional overestimation. In addition, the degree of underestimation is adjusted according to the degree of stationarity at termination, thereby preventing excessive exploration due to the intentional underestimation. Simulations and real robot experiments showed that the proposed method can stably obtain the optimal policies for various tasks and reward designs. https://youtu.be/AxXr8uFOe7M
翻译:基于强化学习的机器人控制已广泛应用,但出于安全与时间效率考量,其学习过程通常会在回合中途终止。本研究针对时序差分学习在终止时所采用的最常见异常处理方法存在的问题展开探讨。具体而言,该方法通过强制假定终止后价值为零,会导致因正常状态下的奖励设计差异而产生非故意的隐性低估或高估现象。当因任务失败终止回合时,非故意高估可能使失败行为获得过高价值评估,从而习得错误策略。尽管通过关注奖励设计可规避此问题,但从时序差分学习实际应用角度出发,重新审视终止时的异常处理机制至关重要。为此,本文提出一种有意低估终止后价值的方法,以避免因非故意高估导致的学习失败。同时,根据终止时的平稳程度动态调整低估幅度,从而防止由有意低估引发的过度探索。仿真实验与真实机器人实验表明,所提方法能在多种任务与奖励设计下稳定获取最优策略。https://youtu.be/AxXr8uFOe7M