Offline reinforcement learning -- learning a policy from a batch of data -- is known to be hard for general MDPs. These results motivate the need to look at specific classes of MDPs where offline reinforcement learning might be feasible. In this work, we explore a restricted class of MDPs to obtain guarantees for offline reinforcement learning. The key property, which we call Action Impact Regularity (AIR), is that actions primarily impact a part of the state (an endogenous component) and have limited impact on the remaining part of the state (an exogenous component). AIR is a strong assumption, but it nonetheless holds in a number of real-world domains including financial markets. We discuss algorithms that exploit the AIR property, and provide a theoretical analysis for an algorithm based on Fitted-Q Iteration. Finally, we demonstrate that the algorithm outperforms existing offline reinforcement learning algorithms across different data collection policies in simulated and real world environments where the regularity holds.
翻译:离线强化学习——即从批量数据中学习策略——在一般马尔可夫决策过程中已知具有困难性。这些结果促使我们需要关注特定类别的马尔可夫决策过程,其中离线强化学习可能可行。本文探索一类受限的马尔可夫决策过程,以获取离线强化学习的保证。关键性质称为行动影响规律性(AIR),即行动主要影响状态的一部分(内生分量),而对状态的其余部分(外生分量)影响有限。AIR是一个强假设,但在包括金融市场在内的多个现实领域中仍然成立。我们讨论了利用AIR性质的算法,并基于Fitted-Q迭代算法提供了理论分析。最后,我们证明,在该规律性成立的模拟和现实环境中,该算法在不同数据收集策略下均优于现有离线强化学习算法。