We study off-policy evaluation (OPE) in partially observable environments with complex observations, with the goal of developing estimators whose guarantee avoids exponential dependence on the horizon. While such estimators exist for MDPs and POMDPs can be converted to history-based MDPs, their estimation errors depend on the state-density ratio for MDPs which becomes history ratios after conversion, an exponential object. Recently, Uehara et al. [2022a] proposed future-dependent value functions as a promising framework to address this issue, where the guarantee for memoryless policies depends on the density ratio over the latent state space. However, it also depends on the boundedness of the future-dependent value function and other related quantities, which we show could be exponential-in-length and thus erasing the advantage of the method. In this paper, we discover novel coverage assumptions tailored to the structure of POMDPs, such as outcome coverage and belief coverage, which enable polynomial bounds on the aforementioned quantities. As a side product, our analyses also lead to the discovery of new algorithms with complementary properties.
翻译:本研究探讨部分可观测环境下具有复杂观测的离线策略评估问题,旨在构建避免误差保证随决策时域呈指数级增长的估计器。虽然马尔可夫决策过程存在此类估计器,且部分可观测马尔可夫决策过程可转换为基于历史的马尔可夫决策过程,但前者的估计误差取决于状态密度比(转换后成为历史比率),这构成指数级复杂对象。最近,Uehara等人[2022a]提出未来依赖价值函数作为解决该问题的框架,其针对无记忆策略的误差保证依赖于潜在状态空间的密度比。然而,该保证还取决于未来依赖价值函数及相关量的有界性,本文证明这些量可能随序列长度呈指数增长,从而抵消该方法的优势。本文针对部分可观测马尔可夫决策过程的结构特性,提出了结果覆盖度和信念覆盖度等新型覆盖假设,使上述量获得多项式界。作为衍生成果,我们的分析还催生了具有互补性质的新算法。