In this paper, we propose a policy gradient method for confounded partially observable Markov decision processes (POMDPs) with continuous state and observation spaces in the offline setting. We first establish a novel identification result to non-parametrically estimate any history-dependent policy gradient under POMDPs using the offline data. The identification enables us to solve a sequence of conditional moment restrictions and adopt the min-max learning procedure with general function approximation for estimating the policy gradient. We then provide a finite-sample non-asymptotic bound for estimating the gradient uniformly over a pre-specified policy class in terms of the sample size, length of horizon, concentratability coefficient and the measure of ill-posedness in solving the conditional moment restrictions. Lastly, by deploying the proposed gradient estimation in the gradient ascent algorithm, we show the global convergence of the proposed algorithm in finding the history-dependent optimal policy under some technical conditions. To the best of our knowledge, this is the first work studying the policy gradient method for POMDPs under the offline setting.
翻译:本文提出一种针对离线环境下具有连续状态与观测空间的混淆部分可观测马尔可夫决策过程(POMDPs)的策略梯度方法。我们首先建立一项新颖的识别结果,利用离线数据对POMDP中任意历史依赖型策略梯度进行非参数估计。该识别过程通过求解一系列条件矩约束,并采用与通用函数逼近相结合的极小极大学习程序来估计策略梯度。随后,我们给出了关于样本量、时间步长、可集中性系数以及条件矩求解不适定度量的非渐近有限样本界,该界可一致性地估计预设策略类上的梯度。最后,通过将所提梯度估计方法嵌入梯度上升算法,我们证明了在特定技术条件下该算法能够全局收敛至历史依赖型最优策略。据我们所知,这是首个研究离线环境下POMDP策略梯度方法的工作。