The goal of an offline reinforcement learning (RL) algorithm is to learn optimal polices using historical (offline) data, without access to the environment for online exploration. One of the main challenges in offline RL is the distribution shift which refers to the difference between the state-action visitation distribution of the data generating policy and the learning policy. Many recent works have used the idea of pessimism for developing offline RL algorithms and characterizing their sample complexity under a relatively weak assumption of single policy concentrability. Different from the offline RL literature, the area of distributionally robust learning (DRL) offers a principled framework that uses a minimax formulation to tackle model mismatch between training and testing environments. In this work, we aim to bridge these two areas by showing that the DRL approach can be used to tackle the distributional shift problem in offline RL. In particular, we propose two offline RL algorithms using the DRL framework, for the tabular and linear function approximation settings, and characterize their sample complexity under the single policy concentrability assumption. We also demonstrate the superior performance our proposed algorithm through simulation experiments.
翻译:离线强化学习算法的目标是利用历史数据学习最优策略,而无需在线探索环境。其面临的主要挑战之一是分布偏移问题,即数据生成策略的状态-动作访问分布与学习策略之间存在差异。近年来的许多研究采用悲观主义思想开发离线强化学习算法,并在单策略集中性这一相对较弱的假设下刻画其样本复杂度。与离线强化学习文献不同,分布鲁棒学习领域提供了一套基于极小极大公式的理论框架,用于处理训练与测试环境间的模型失配问题。本研究旨在弥合这两个领域,证明分布鲁棒学习方法可有效解决离线强化学习中的分布偏移问题。具体而言,我们提出两种基于分布鲁棒学习框架的离线强化学习算法,分别适用于表格化与线性函数近似场景,并在单策略集中性假设下刻画其样本复杂度。通过仿真实验,我们进一步验证了所提算法的优越性能。