We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators and fitted-Q evaluation suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs. Future-dependent value functions play similar roles as classical value functions in fully-observable MDPs. We derive a new Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is consistent as long as futures and histories contain sufficient information about latent states, and the Bellman completeness. Finally, we extend our methods to learning of dynamics and establish the connection between our approach and the well-known spectral learning methods in POMDPs.
翻译:我们研究具有一般函数逼近能力的部分可观测马尔可夫决策过程(POMDPs)中的离线策略评估(OPE)问题。现有方法(如序贯重要性采样估计器和拟合Q值评估)在POMDPs中受限于时间步长的维数灾难。为解决此问题,我们通过引入以未来代理变量为输入的“未来依赖价值函数”,提出一种新颖的无模型OPE方法。未来依赖价值函数在功能上类似于完全可观测MDPs中的经典价值函数。我们推导出基于条件矩方程的新型贝尔曼方程,该方程以历史代理变量作为工具变量。进一步,我们提出一种极小化极大学习方法,利用该贝尔曼方程学习未来依赖价值函数。我们获得了PAC结果,表明当未来与历史变量包含潜在状态的充分信息且满足贝尔曼完备性时,我们的OPE估计量具有一致性。最后,我们将方法扩展至动力系统学习,并建立了本方法与POMDPs中著名的谱学习方法之间的关联。