Off-Policy Estimation (OPE) methods allow us to learn and evaluate decision-making policies from logged data. This makes them an attractive choice for the offline evaluation of recommender systems, and several recent works have reported successful adoption of OPE methods to this end. An important assumption that makes this work is the absence of unobserved confounders: random variables that influence both actions and rewards at data collection time. Because the data collection policy is typically under the practitioner's control, the unconfoundedness assumption is often left implicit, and its violations are rarely dealt with in the existing literature. This work aims to highlight the problems that arise when performing off-policy estimation in the presence of unobserved confounders, specifically focusing on a recommendation use-case. We focus on policy-based estimators, where the logging propensities are learned from logged data. We characterise the statistical bias that arises due to confounding, and show how existing diagnostics are unable to uncover such cases. Because the bias depends directly on the true and unobserved logging propensities, it is non-identifiable. As the unconfoundedness assumption is famously untestable, this becomes especially problematic. This paper emphasises this common, yet often overlooked issue. Through synthetic data, we empirically show how na\"ive propensity estimation under confounding can lead to severely biased metric estimates that are allowed to fly under the radar. We aim to cultivate an awareness among researchers and practitioners of this important problem, and touch upon potential research directions towards mitigating its effects.
翻译:离线策略评估(OPE)方法使我们能够从记录数据中学习和评估决策策略。这使得它们成为推荐系统离线评估的吸引人选择,且近期多项研究已成功将OPE方法用于此目的。实现这一工作的重要前提是假设不存在未观测混淆变量:即在数据收集时影响动作和奖励的随机变量。由于数据收集策略通常由实践者控制,无混淆假设往往被默认为成立,现有文献中极少处理其违反情况。本文旨在揭示在存在未观测混淆变量时进行离线策略评估所引发的问题,重点聚焦于推荐场景。我们聚焦于基于策略的估计器,其中记录倾向性从记录数据中学习得到。我们刻画了因混淆导致的统计偏差,并展示现有诊断方法无法识别此类情况。由于偏差直接依赖于真实且未观测的记录倾向性,该偏差具有不可辨识性。而无混淆假设本身著名地不可检验,这使得问题尤为严峻。本文强调这一常见但常被忽视的问题。通过合成数据,我们实证展示了在混淆条件下朴素倾向性估计如何导致严重偏差的度量估计值,这些偏差可能悄然存在。我们旨在培养研究者和实践者对这一重要问题的认知,并探讨减轻其影响的潜在研究方向。