In this paper, we propose a policy gradient method for confounded partially observable Markov decision processes (POMDPs) with continuous state and observation spaces in the offline setting. We first establish a novel identification result to non-parametrically estimate any history-dependent policy gradient under POMDPs using the offline data. The identification enables us to solve a sequence of conditional moment restrictions and adopt the min-max learning procedure with general function approximation for estimating the policy gradient. We then provide a finite-sample non-asymptotic bound for estimating the gradient uniformly over a pre-specified policy class in terms of the sample size, length of horizon, concentratability coefficient and the measure of ill-posedness in solving the conditional moment restrictions. Lastly, by deploying the proposed gradient estimation in the gradient ascent algorithm, we show the global convergence of the proposed algorithm in finding the history-dependent optimal policy under some technical conditions. To the best of our knowledge, this is the first work studying the policy gradient method for POMDPs under the offline setting.
翻译:本文针对离线场景下具有连续状态空间与观测空间的混淆部分可观测马尔可夫决策过程,提出一种策略梯度方法。我们首先建立新的非参数辨识结果,使得利用离线数据估计POMDP中任意历史依赖策略梯度成为可能。该辨识方法通过求解条件矩约束序列并采用通用函数逼近的极小极大学习流程实现策略梯度估计。随后,我们给出关于样本量、时间跨度、可集中性系数以及求解条件矩约束时不适定程度的度量等参数的有限样本非渐近上界,该上界保证在预定义策略类上一致估计梯度的效果。最后,通过将所提梯度估计方法应用于梯度上升算法,我们证明在特定技术条件下该算法能够全局收敛至历史依赖最优策略。据我们所知,这是首个研究离线场景下POMDP策略梯度方法的工作。