Envisioned application areas for reinforcement learning (RL) include autonomous driving, precision agriculture, and finance, which all require RL agents to make decisions in the real world. A significant challenge hindering the adoption of RL methods in these domains is the non-robustness of conventional algorithms. In this paper, we argue that a fundamental issue contributing to this lack of robustness lies in the focus on the expected value of the return as the sole "correct" optimization objective. The expected value is the average over the statistical ensemble of infinitely many trajectories. For non-ergodic returns, this average differs from the average over a single but infinitely long trajectory. Consequently, optimizing the expected value can lead to policies that yield exceptionally high returns with probability zero but almost surely result in catastrophic outcomes. This problem can be circumvented by transforming the time series of collected returns into one with ergodic increments. This transformation enables learning robust policies by optimizing the long-term return for individual agents rather than the average across infinitely many trajectories. We propose an algorithm for learning ergodicity transformations from data and demonstrate its effectiveness in an instructive, non-ergodic environment and on standard RL benchmarks.
翻译:强化学习(RL)在自动驾驶、精准农业和金融等领域的应用展望,要求RL智能体能够在现实世界中做出决策。传统算法缺乏鲁棒性是阻碍RL方法在这些领域应用的关键挑战。本文认为,导致这种鲁棒性缺失的根本问题在于将回报期望值作为唯一"正确"优化目标的倾向。期望值是无限多条轨迹统计集合的平均值。对于非遍历性回报而言,该平均值与单条无限长轨迹的平均值存在差异。因此,优化期望值可能导致产生概率为零但几乎必然导致灾难性后果的极高回报策略。通过将收集的回报时间序列转换为具有遍历增量的序列,可以规避该问题。这种转换能够优化单个智能体的长期回报(而非无限多条轨迹的均值),从而学习鲁棒性策略。我们提出了一种从数据中学习遍历性变换的算法,并在具有教学意义的非遍历性环境以及标准RL基准测试中验证了其有效性。