Doubly robust methods hold considerable promise for off-policy evaluation in Markov decision processes (MDPs) under sequential ignorability: They have been shown to converge as $1/\sqrt{T}$ with the horizon $T$, to be statistically efficient in large samples, and to allow for modular implementation where preliminary estimation tasks can be executed using standard reinforcement learning techniques. Existing results, however, make heavy use of a strong distributional overlap assumption whereby the stationary distributions of the target policy and the data-collection policy are within a bounded factor of each other -- and this assumption is typically only credible when the state space of the MDP is bounded. In this paper, we re-visit the task of off-policy evaluation in MDPs under a weaker notion of distributional overlap, and introduce a class of truncated doubly robust (TDR) estimators which we find to perform well in this setting. When the distribution ratio of the target and data-collection policies is square-integrable (but not necessarily bounded), our approach recovers the large-sample behavior previously established under strong distributional overlap. When this ratio is not square-integrable, TDR is still consistent but with a slower-than-$1/\sqrt{T}$; furthermore, this rate of convergence is minimax over a class of MDPs defined only using mixing conditions. We validate our approach numerically and find that, in our experiments, appropriate truncation plays a major role in enabling accurate off-policy evaluation when strong distributional overlap does not hold.
翻译:双重稳健方法在序列可忽略性假设下对马尔可夫决策过程的离策略评估具有显著优势:其已被证明在时间跨度T下以$1/\sqrt{T}$的速度收敛,在大样本中具有统计有效性,并允许模块化实现——其中初步估计任务可通过标准强化学习技术执行。然而现有结果强烈依赖强分布重叠假设,即目标策略与数据收集策略的平稳分布彼此有界因子范围内——该假设通常在MDP状态空间有界时才具有可信性。本文重新审视在弱分布重叠概念下的MDP离策略评估任务,引入一类截断双重稳健估计量,发现其在此场景中表现优异。当目标策略与数据收集策略的分布比平方可积(但不必然有界)时,我们的方法恢复了先前在强分布重叠假设下建立的大样本特性。当该比值非平方可积时,TDR仍保持一致性但收敛速度慢于$1/\sqrt{T}$;此外,该收敛速率在仅依赖混合条件定义的MDP类上具有极小极大最优性。我们通过数值实验验证了该方法,结果表明在强分布重叠不成立的情况下,适当截断对实现精确离策略评估发挥关键作用。