Off-policy evaluation (OPE) is the problem of estimating the value of a target policy using historical data collected under a different logging policy. OPE methods typically assume overlap between the target and logging policy, enabling solutions based on importance weighting and/or imputation. In this work, we approach OPE without assuming either overlap or a well-specified model by considering a strategy based on partial identification under non-parametric assumptions on the conditional mean function, focusing especially on Lipschitz smoothness. Under such smoothness assumptions, we formulate a pair of linear programs whose optimal values upper and lower bound the contributions of the no-overlap region to the off-policy value. We show that these linear programs have a concise closed form solution that can be computed efficiently and that their solutions converge, under the Lipschitz assumption, to the sharp partial identification bounds on the off-policy value. Furthermore, we show that the rate of convergence is minimax optimal, up to log factors. We deploy our methods on two semi-synthetic examples, and obtain informative and valid bounds that are tighter than those possible without smoothness assumptions.
翻译:离策略评估(OPE)是指在不同的日志策略下收集的历史数据中,估计目标策略价值的问题。典型的OPE方法要求目标策略与日志策略之间存在重叠,从而支持基于重要性加权和/或插补的解决方案。本研究在不假设重叠或良好指定模型的前提下,通过考虑基于条件均值函数非参数假设(特别是Lipschitz平滑性)的部分识别策略来处理OPE问题。在此类平滑性假设下,我们构建了一对线性规划,其最优值分别为无重叠区域对离策略价值的贡献提供上下界。我们证明这些线性规划具有简洁的闭式解,可高效计算,且在Lipschitz假设下其解收敛至离策略价值的尖锐部分识别界。此外,我们证明收敛速率(除对数因子外)达到极小极大最优。我们将方法应用于两个半合成示例,获得了比无平滑性假设时更紧致且具有信息量的有效界。