We study Off-Policy Evaluation (OPE) in contextual bandit settings with large action spaces. The benchmark estimators suffer from severe bias and variance tradeoffs. Parametric approaches suffer from bias due to difficulty specifying the correct model, whereas ones with importance weight suffer from variance. To overcome these limitations, Marginalized Inverse Propensity Scoring (MIPS) was proposed to mitigate the estimator's variance via embeddings of an action. To make the estimator more accurate, we propose the doubly robust estimator of MIPS called the Marginalized Doubly Robust (MDR) estimator. Theoretical analysis shows that the proposed estimator is unbiased under weaker assumptions than MIPS while maintaining variance reduction against IPS, which was the main advantage of MIPS. The empirical experiment verifies the supremacy of MDR against existing estimators.
翻译:我们研究具有大动作空间的上下文bandit设置中的离线策略评估(OPE)。基准估计器在偏差和方差权衡方面存在严重问题。参数化方法由于难以指定正确模型而产生偏差,而基于重要性权重的方法则存在方差问题。为了克服这些限制,提出了边际逆概率评分(MIPS),通过动作的嵌入来缓解估计器的方差。为了进一步提高估计器的准确性,我们提出了MIPS的双重稳健估计量,称为边际双重稳健(MDR)估计量。理论分析表明,所提出的估计量在比MIPS更弱的假设下无偏,同时保持了对IPS的方差减少,这是MIPS的主要优势。实证实验验证了MDR相对于现有估计器的优越性。