We consider off-policy evaluation of dynamic treatment rules under sequential ignorability, given an assumption that the underlying system can be modeled as a partially observed Markov decision process (POMDP). We propose an estimator, partial history importance weighting, and show that it can consistently estimate the stationary mean rewards of a target policy given long enough draws from the behavior policy. We provide an upper bound on its error that decays polynomially in the number of observations (i.e., the number of trajectories times their length), with an exponent that depends on the overlap of the target and behavior policies, and on the mixing time of the underlying system. Furthermore, we show that this rate of convergence is minimax given only our assumptions on mixing and overlap. Our results establish that off-policy evaluation in POMDPs is strictly harder than off-policy evaluation in (fully observed) Markov decision processes, but strictly easier than model-free off-policy evaluation.
翻译:我们考虑在顺序可忽略性假设下,对动态治疗规则进行离策略评估,其中潜在系统可建模为部分可观测马尔可夫决策过程(POMDP)。我们提出一种估计量——部分历史重要性加权,并证明在给定足够长的行为策略采样序列下,该估计量能够一致地估计目标策略的平稳均值回报。我们给出了其误差的上界,该误差随观测数量(即轨迹数乘以其长度)呈多项式衰减,其指数取决于目标策略与行为策略的重叠程度以及潜在系统的混合时间。进一步地,我们证明在仅给定混合性与重叠性假设的条件下,该收敛速率达到极小极大最优。我们的结果表明,POMDP中的离策略评估严格难于(完全可观测)马尔可夫决策过程中的离策略评估,但严格易于无模型离策略评估。