We consider off-policy evaluation (OPE) in contextual bandits with finite action space. Inverse Propensity Score (IPS) weighting is a widely used method for OPE due to its unbiased, but it suffers from significant variance when the action space is large or when some parts of the context-action space are underexplored. Recently introduced Marginalized IPS (MIPS) estimators mitigate this issue by leveraging action embeddings. However, these embeddings do not minimize the mean squared error (MSE) of the estimators and do not consider context information. To address these limitations, we introduce Context-Action Embedding Learning for MIPS, or CAEL-MIPS, which learns context-action embeddings from offline data to minimize the MSE of the MIPS estimator. Building on the theoretical analysis of bias and variance of MIPS, we present an MSE-minimizing objective for CAEL-MIPS. In the empirical studies on a synthetic dataset and a real-world dataset, we demonstrate that our estimator outperforms baselines in terms of MSE.
翻译:我们考虑在有限动作空间的上下文赌博机中进行离策略评估。逆倾向得分加权因其无偏性而成为广泛使用的离策略评估方法,但当动作空间较大或上下文-动作空间的某些部分未被充分探索时,该方法会存在显著方差问题。最近提出的边际化逆倾向得分估计器通过利用动作嵌入缓解了这一问题。然而,这些嵌入既未最小化估计器的均方误差,也未考虑上下文信息。为克服这些局限,我们提出了用于边际化逆倾向得分估计器的上下文-动作嵌入学习方法(CAEL-MIPS),该方法从离线数据中学习上下文-动作嵌入以最小化边际化逆倾向得分估计器的均方误差。基于对边际化逆倾向得分估计器偏差与方差的理论分析,我们提出了CAEL-MIPS的均方误差最小化目标函数。在合成数据集和真实世界数据集的实证研究中,我们证明了所提估计器在均方误差指标上优于基线方法。