In reinforcement learning, we typically aim to optimize the expected value of the sum of rewards an agent collects over a trajectory. However, if the process generating these rewards is non-ergodic, the expected value, i.e., the average over infinitely many trajectories with a given policy, is uninformative for the average over a single, but infinitely long trajectory. Thus, if we care about how the individual agent performs during deployment, the expected value is not a good optimization objective. In this paper, we discuss the impact of non-ergodic reward processes on reinforcement learning agents through an instructive example, relate the notion of ergodic reward processes to more widely used notions of ergodic Markov chains, and present existing solutions that optimize long-term performance of individual trajectories under non-ergodic reward dynamics.
翻译:在强化学习中,我们通常旨在优化智能体在一条轨迹上收集的奖励总和的期望值。然而,如果生成这些奖励的过程是非遍历的,那么期望值——即在给定策略下对无穷多条轨迹的平均——对于单条无穷长轨迹的平均而言是缺乏信息量的。因此,如果我们关心的是个体智能体在部署期间的表现,那么期望值就不是一个良好的优化目标。本文通过一个启发性示例,讨论了非遍历奖励过程对强化学习智能体的影响,将遍历奖励过程的概念与更广泛使用的遍历马尔可夫链概念联系起来,并介绍了在非遍历奖励动态下优化单条轨迹长期性能的现有解决方案。