In reinforcement learning (RL), different rewards can define the same optimal policy but result in drastically different learning performance. For some, the agent gets stuck with a suboptimal behavior, and for others, it solves the task efficiently. Choosing a good reward function is hence an extremely important yet challenging problem. In this paper, we explore an alternative approach to using rewards for learning. We introduce max-reward RL, where an agent optimizes the maximum rather than the cumulative reward. Unlike earlier works, our approach works for deterministic and stochastic environments and can be easily combined with state-of-the-art RL algorithms. In the experiments, we study the performance of max-reward RL algorithms in two goal-reaching environments from Gymnasium-Robotics and demonstrate its benefits over standard RL. The code is publicly available.
翻译:在强化学习中,不同奖励函数可定义相同的最优策略,却会导致截然不同的学习性能。部分奖励函数会使智能体陷入次优行为,而另一些则能使其高效完成任务。因此,选择恰当的奖励函数是一个极其重要但充满挑战的问题。本文探索了使用奖励进行学习的替代方法,提出最大化奖励强化学习——智能体通过优化最大值而非累积奖励进行学习。与先前研究不同,我们的方法适用于确定性与随机环境,并能轻松与最先进的强化学习算法结合。实验部分,我们在Gymnasium-Robotics的两个目标到达环境中验证了最大化奖励强化学习算法的性能,并展示了其相较于标准强化学习的优势。相关代码已公开发布。