We study Off-Policy Evaluation (OPE) in contextual bandit settings with large action spaces. The benchmark estimators suffer from severe bias and variance tradeoffs. Parametric approaches suffer from bias due to difficulty specifying the correct model, whereas ones with importance weight suffer from variance. To overcome these limitations, Marginalized Inverse Propensity Scoring (MIPS) was proposed to mitigate the estimator's variance via embeddings of an action. Nevertheless, MIPS is unbiased under the no direct effect, which assumes that the action embedding completely mediates the effect of an action on a reward. To overcome the dependency on these unrealistic assumptions, we propose a Marginalized Doubly Robust (MDR) estimator. Theoretical analysis shows that the proposed estimator is unbiased under weaker assumptions than MIPS while reducing the variance against MIPS. The empirical experiment verifies the supremacy of MDR against existing estimators with large action spaces.
翻译:我们研究上下文赌博机设置中具有大动作空间的离线策略评估问题。基准估计量存在严重的偏差与方差权衡问题。参数化方法因难以指定正确模型而产生偏差,而基于重要性权重的方法则存在方差过大的缺陷。为克服这些局限,现有研究提出了边缘化逆倾向得分(MIPS)方法,通过动作嵌入来降低估计量的方差。然而,MIPS仅在无直接效应假设下保持无偏性,该假设要求动作嵌入完全中介动作对奖励的影响。为摆脱对这类非现实假设的依赖,我们提出边缘化倍差稳健(MDR)估计量。理论分析表明,与MIPS相比,本估计量在更弱假设条件下保持无偏性,同时可降低方差。实验验证了MDR在具有大动作空间的场景中相较于现有估计量的优越性。