Reward shaping (RS) is a powerful method in reinforcement learning (RL) for overcoming the problem of sparse or uninformative rewards. However, RS typically relies on manually engineered shaping-reward functions whose construction is time-consuming and error-prone. It also requires domain knowledge which runs contrary to the goal of autonomous learning. We introduce Reinforcement Learning Optimising Shaping Algorithm (ROSA), an automated reward shaping framework in which the shaping-reward function is constructed in a Markov game between two agents. A reward-shaping agent (Shaper) uses switching controls to determine which states to add shaping rewards for more efficient learning while the other agent (Controller) learns the optimal policy for the task using these shaped rewards. We prove that ROSA, which adopts existing RL algorithms, learns to construct a shaping-reward function that is beneficial to the task thus ensuring efficient convergence to high performance policies. We demonstrate ROSA's properties in three didactic experiments and show its superior performance against state-of-the-art RS algorithms in challenging sparse reward environments.
翻译:奖励塑形(Reward Shaping, RS)是强化学习(Reinforcement Learning, RL)中解决稀疏或非信息性奖励问题的重要方法。然而,RS通常依赖人工设计的塑形奖励函数,其构建过程耗时且易出错,同时需要领域知识,这与自主学习的根本目标相悖。我们提出强化学习优化的塑形算法(Reinforcement Learning Optimising Shaping Algorithm, ROSA),这是一种自动奖励塑形框架,其塑形奖励函数通过两个智能体之间的马尔可夫博弈构建。一个奖励塑形智能体(塑形者)通过切换控制决定在哪些状态中添加塑形奖励以实现更高效的学习,而另一个智能体(控制器)则利用这些塑形奖励学习任务的最优策略。我们证明,采用现有RL算法的ROSA能够学习构建对任务有益的塑形奖励函数,从而确保高效收敛至高性能策略。我们通过三项教学实验展示了ROSA的特性,并在具有挑战性的稀疏奖励环境中证明其性能优于最先进的RS算法。