Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy. OPE is a viable alternative to running expensive online A/B tests: it can speed up the development of new policies, and reduces the risk of exposing customers to suboptimal treatments. However, when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance. Saito and Joachims (arXiv:2202.06317v2 [cs.LG]) propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces. MIPS assumes that good action embeddings can be defined by the practitioner, which is difficult to do in many real-world applications. In this work, we explore learning action embeddings from logged data. In particular, we use intermediate outputs of a trained reward model to define action embeddings for MIPS. This approach extends MIPS to more applications, and in our experiments improves upon MIPS with pre-defined embeddings, as well as standard baselines, both on synthetic and real-world data. Our method does not make assumptions about the reward model class, and supports using additional action information to further improve the estimates. The proposed approach presents an appealing alternative to DR for combining the low variance of DM with the low bias of IPS.
翻译:离策略评估(OPE)方法允许我们通过使用由不同策略收集的日志数据来计算策略的期望奖励。OPE是运行昂贵的在线A/B测试的可行替代方案:它可以加速新策略的开发,并降低将客户暴露于次优策略的风险。然而,当动作数量较多或日志策略对某些动作的探索不足时,基于逆倾向得分(IPS)的现有估计器可能具有高方差甚至无限方差。Saito和Joachims(arXiv:2202.06317v2 [cs.LG])提出了使用动作嵌入的边际化IPS(MIPS),该方法在大型动作空间中降低了IPS的方差。MIPS假设从业者可以定义良好的动作嵌入,这在许多实际应用中难以实现。在本工作中,我们探索从日志数据中学习动作嵌入。具体而言,我们使用训练好的奖励模型的中间输出来定义MIPS的动作嵌入。这种方法将MIPS扩展到更多应用,并且在我们的实验中,无论是在合成数据还是真实世界数据上,都优于使用预定义嵌入的MIPS以及标准基线。我们的方法不对奖励模型类别做假设,并支持使用额外的动作信息来进一步改进估计。所提出的方法为结合DM的低方差与IPS的低偏差,提供了DR的一种有吸引力的替代方案。