Reinforcement learning is commonly concerned with problems of maximizing accumulated rewards in Markov decision processes. Oftentimes, a certain goal state or a subset of the state space attain maximal reward. In such a case, the environment may be considered solved when the goal is reached. Whereas numerous techniques, learning or non-learning based, exist for solving environments, doing so optimally is the biggest challenge. Say, one may choose a reward rate which penalizes the action effort. Reinforcement learning is currently among the most actively developed frameworks for solving environments optimally by virtue of maximizing accumulated reward, in other words, returns. Yet, tuning agents is a notoriously hard task as reported in a series of works. Our aim here is to help the agent learn a near-optimal policy efficiently while ensuring a goal reaching property of some basis policy that merely solves the environment. We suggest an algorithm, which is fairly flexible, and can be used to augment practically any agent as long as it comprises of a critic. A formal proof of a goal reaching property is provided. Simulation experiments on six problems under five agents, including the benchmarked one, provided an empirical evidence that the learning can indeed be boosted while ensuring goal reaching property.
翻译:强化学习通常关注于在马尔可夫决策过程中最大化累积奖励的问题。通常,某个目标状态或状态空间的某个子集能获得最大奖励。在这种情况下,当目标达成时,可认为环境问题已解决。尽管存在大量基于学习或非学习的技术用于解决环境问题,但以最优方式解决仍是最大的挑战。例如,可以选择一种惩罚行动消耗的奖励率。目前,强化学习是通过最大化累积奖励(即回报)来最优解决环境问题的最活跃开发框架之一。然而,正如一系列研究所报告,调整智能体是一项众所周知的困难任务。我们的目标是帮助智能体高效地学习一个接近最优的策略,同时确保某个仅能解决环境问题的基础策略具备目标可达性。我们提出一种相当灵活的算法,只要智能体包含评价器,该算法即可用于增强几乎任何智能体。我们提供了目标可达性的形式化证明。在五个智能体(包括基准智能体)上对六个问题进行的仿真实验提供了实证证据,表明在确保目标可达性的同时,学习效果确实可以得到提升。