Off-policy evaluation (OPE) is a critical challenge in robust decision-making that seeks to assess the performance of a new policy using data collected under a different policy. However, the existing OPE methodologies suffer from several limitations arising from statistical uncertainty as well as causal considerations. In this thesis, we address these limitations by presenting three different works. Firstly, we consider the problem of high variance in the importance-sampling-based OPE estimators. We introduce the Marginal Ratio (MR) estimator, a novel OPE method that reduces variance by focusing on the marginal distribution of outcomes rather than direct policy shifts, improving robustness in contextual bandits. Next, we propose Conformal Off-Policy Prediction (COPP), a principled approach for uncertainty quantification in OPE that provides finite-sample predictive intervals, ensuring robust decision-making in risk-sensitive applications. Finally, we address causal unidentifiability in off-policy decision-making by developing novel bounds for sequential decision settings, which remain valid under arbitrary unmeasured confounding. We apply these bounds to assess the reliability of digital twin models, introducing a falsification framework to identify scenarios where model predictions diverge from real-world behaviour. Our contributions provide new insights into robust decision-making under uncertainty and establish principled methods for evaluating policies in both static and dynamic settings.
翻译:离策略评估(OPE)是稳健决策中的一个关键挑战,旨在利用在不同策略下收集的数据来评估新策略的性能。然而,现有的OPE方法存在若干局限性,这些局限性源于统计不确定性以及因果考量。在本论文中,我们通过提出三项不同的工作来解决这些局限性。首先,我们考虑了基于重要性采样的OPE估计器方差高的问题。我们引入了边际比率(MR)估计器,这是一种新颖的OPE方法,通过关注结果的边际分布而非直接策略偏移来降低方差,从而提高了上下文赌博机中的稳健性。其次,我们提出了保形离策略预测(COPP),这是一种用于OPE中不确定性量化的原则性方法,它提供了有限样本预测区间,确保了风险敏感应用中的稳健决策。最后,我们通过为序列决策设置开发新颖的边界,解决了离策略决策中的因果不可识别性问题,这些边界在任意未测量混杂下仍然有效。我们将这些边界应用于评估数字孪生模型的可靠性,引入了一个证伪框架来识别模型预测与现实世界行为出现分歧的场景。我们的贡献为不确定性下的稳健决策提供了新的见解,并为评估静态和动态设置中的策略建立了原则性方法。