We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \underline{P}roxy variable \underline{P}essimistic \underline{P}olicy \underline{O}ptimization (\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that \texttt{P3O} achieves a $n^{-1/2}$-suboptimality, where $n$ is the number of trajectories in the dataset. To our best knowledge, \texttt{P3O} is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
翻译:我们研究部分可观测马尔可夫决策过程中的离线强化学习。具体而言,我们旨在从由可能依赖潜在状态的行为策略收集的数据集中学习最优策略。此类数据集存在混杂性,即潜在状态同时影响动作和观测,这对现有离线强化学习算法构成障碍。为此,我们提出代理变量悲观策略优化(\texttt{P3O})算法,该算法在通用函数逼近的背景下解决了混杂偏差以及最优策略与行为策略之间的分布偏移问题。\texttt{P3O}的核心是通过近端因果推断构建的一连串耦合悲观置信区间,其被表述为极小极大估计。在混杂数据集的局部覆盖假设下,我们证明\texttt{P3O}可以达到$n^{-1/2}$的次优性,其中$n$是数据集中的轨迹数量。据我们所知,\texttt{P3O}是首个针对具有混杂数据集的POMDPs可证明高效的离线强化学习算法。