We consider off-policy evaluation of dynamic treatment rules under sequential ignorability, given an assumption that the underlying system can be modeled as a partially observed Markov decision process (POMDP). We propose an estimator, partial history importance weighting, and show that it can consistently estimate the stationary mean rewards of a target policy given long enough draws from the behavior policy. We provide an upper bound on its error that decays polynomially in the number of observations (i.e., the number of trajectories times their length), with an exponent that depends on the overlap of the target and behavior policies, and on the mixing time of the underlying system. Furthermore, we show that this rate of convergence is minimax given only our assumptions on mixing and overlap. Our results establish that off-policy evaluation in POMDPs is strictly harder than off-policy evaluation in (fully observed) Markov decision processes, but strictly easier than model-free off-policy evaluation.
翻译:我们考虑在序贯可忽略性假设下动态治疗规则的离策略评估问题,其中基础系统可建模为部分可观测马尔可夫决策过程。本文提出一种名为"部分历史重要性加权"的估计量,并证明在给定足够长的行为策略采样序列时,该估计量能够一致估计目标策略的平稳均值回报。我们给出了该估计误差的上界,该误差随观测数量(即轨迹数量与其长度之积)呈多项式衰减,其指数取决于目标策略与行为策略的重叠程度及基础系统的混合时间。进一步证明,在仅依赖混合性与重叠性假设的条件下,该收敛速度达到极小化最优。我们的研究结果表明,部分可观测马尔可夫决策过程中的离策略评估严格难于(完全可观测)马尔可夫决策过程,但严格易于无模型离策略评估。