Off-policy evaluation (OPE) is a method for estimating the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. In some cases, there may be unmeasured variables that can confound the action-reward or action-next-state relationships, rendering many existing OPE approaches ineffective. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes (MDPs). Similar to single-stage decision making, we show that IV enables us to correctly identify the target policy's value in infinite horizon settings as well. Furthermore, we propose an efficient and robust value estimator and illustrate its effectiveness through extensive simulations and analysis of real data from a world-leading short-video platform.
翻译:离策略评估(OPE)是一种利用由潜在不同行为策略生成的预收集观测数据来估计目标策略回报的方法。在某些情况下,可能存在未观测变量混淆行动-奖励或行动-下一状态关系,导致许多现有OPE方法失效。本文提出了一种基于工具变量(IV)的混杂马尔可夫决策过程(MDPs)中一致OPE方法。与单阶段决策类似,我们证明IV能够正确识别无限时域设定下目标策略的价值。此外,我们提出了一种高效且稳健的价值估计器,并通过大规模仿真实验和来自全球领先短视频平台的真实数据分析验证了其有效性。