Ranking interfaces are everywhere in online platforms. There is thus an ever growing interest in their Off-Policy Evaluation (OPE), aiming towards an accurate performance evaluation of ranking policies using logged data. A de-facto approach for OPE is Inverse Propensity Scoring (IPS), which provides an unbiased and consistent value estimate. However, it becomes extremely inaccurate in the ranking setup due to its high variance under large action spaces. To deal with this problem, previous studies assume either independent or cascade user behavior, resulting in some ranking versions of IPS. While these estimators are somewhat effective in reducing the variance, all existing estimators apply a single universal assumption to every user, causing excessive bias and variance. Therefore, this work explores a far more general formulation where user behavior is diverse and can vary depending on the user context. We show that the resulting estimator, which we call Adaptive IPS (AIPS), can be unbiased under any complex user behavior. Moreover, AIPS achieves the minimum variance among all unbiased estimators based on IPS. We further develop a procedure to identify the appropriate user behavior model to minimize the mean squared error (MSE) of AIPS in a data-driven fashion. Extensive experiments demonstrate that the empirical accuracy improvement can be significant, enabling effective OPE of ranking systems even under diverse user behavior.
翻译:排序界面在在线平台中无处不在。因此,对其离线评估(OPE)的兴趣与日俱增,旨在利用日志数据准确评估排序策略的性能。离线评估的标准方法是逆倾向得分(IPS),它提供无偏且一致的价值估计。然而,在排序场景中,由于动作空间庞大导致方差过高,IPS变得极不准确。为解决此问题,先前的研究假设用户行为为独立式或级联式,从而衍生出一些排序版本的IPS。尽管这些估计量在降低方差方面有一定效果,但现有所有估计量均对所有用户采用单一通用假设,导致过度的偏差与方差。因此,本文探索了一种更为通用的形式,其中用户行为具有多样性,并可随用户上下文而变化。我们证明,由此产生的估计量(我们称之为自适应IPS,AIPS)可在任意复杂用户行为下保持无偏性。此外,在所有基于IPS的无偏估计量中,AIPS实现了最小方差。我们进一步开发了一种数据驱动过程,以识别合适的用户行为模型,从而最小化AIPS的均方误差(MSE)。大量实验表明,即使面对多样化的用户行为,其经验准确性提升也相当显著,从而能够实现排序系统的有效离线评估。