In off policy evaluation (OPE) for partially observable Markov decision processes (POMDPs), an agent must infer hidden states from past observations, which exacerbates both the curse of horizon and the curse of memory in existing OPE methods. This paper introduces a novel covering analysis framework that exploits the intrinsic metric structure of the belief space (distributions over latent states) to relax traditional coverage assumptions. By assuming value relevant functions are Lipschitz continuous in the belief space, we derive error bounds that mitigate exponential blow ups in horizon and memory length. Our unified analysis technique applies to a broad class of OPE algorithms, yielding concrete error bounds and coverage requirements expressed in terms of belief space metrics rather than raw history coverage. We illustrate the improved sample efficiency of this framework via case studies: the double sampling Bellman error minimization algorithm, and the memory based future dependent value functions (FDVF). In both cases, our coverage definition based on the belief space metric yields tighter bounds.
翻译:在部分可观测马尔可夫决策过程(POMDP)的离线策略评估(OPE)中,智能体必须从过往观测中推断隐藏状态,这加剧了现有OPE方法中的“视野诅咒”与“记忆诅咒”。本文提出一种新颖的覆盖分析框架,通过利用信念空间(潜在状态上的分布)的内在度量结构来放松传统覆盖假设。通过假设价值相关函数在信念空间中满足Lipschitz连续性,我们推导出能够缓解视野与记忆长度指数级爆炸的误差界。我们的统一分析技术适用于广泛的OPE算法类别,所得到的误差界与覆盖需求均以信念空间度量而非原始历史覆盖来表达。我们通过案例研究说明该框架提升的样本效率:双重采样贝尔曼误差最小化算法,以及基于记忆的未来依赖价值函数(FDVF)。在这两种情况下,我们基于信念空间度量的覆盖定义均能导出更紧致的误差界。