We consider offline reinforcement learning (RL) methods in possibly nonstationary environments. Many existing RL algorithms in the literature rely on the stationarity assumption that requires the system transition and the reward function to be constant over time. However, the stationarity assumption is restrictive in practice and is likely to be violated in a number of applications, including traffic signal control, robotics and mobile health. In this paper, we develop a consistent procedure to test the nonstationarity of the optimal Q-function based on pre-collected historical data, without additional online data collection. Based on the proposed test, we further develop a sequential change point detection method that can be naturally coupled with existing state-of-the-art RL methods for policy optimization in nonstationary environments. The usefulness of our method is illustrated by theoretical results, simulation studies, and a real data example from the 2018 Intern Health Study. A Python implementation of the proposed procedure is available at https://github.com/limengbinggz/CUSUM-RL.
翻译:我们考虑在可能非平稳环境下的离线强化学习方法。现有文献中的许多强化学习算法依赖于平稳性假设,即要求系统转移和奖励函数随时间保持不变。然而,平稳性假设在实践中具有局限性,在交通信号控制、机器人技术和移动健康等应用中可能被违反。本文基于预收集的历史数据,开发了一种无需额外在线数据收集的一致性程序,用于检验最优Q函数的非平稳性。基于所提出的检验方法,我们进一步开发了一种序贯变点检测方法,该方法可自然地与现有最先进的强化学习方法相结合,用于非平稳环境下的策略优化。理论结果、仿真研究以及来自2018年实习生健康研究的真实数据示例验证了我们方法的实用性。所提出方法的Python实现可在https://github.com/limengbinggz/CUSUM-RL获取。