We study off-policy evaluation (OPE) in the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates. This problem is widespread in recommender systems, search engines, marketing, to medical applications, however, the typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces, making effective OPE a significant challenge. The PseudoInverse (PI) estimator has been introduced to mitigate the variance issue by assuming linearity in the reward function, but this can result in significant bias as this assumption is hard-to-verify from observed data and is often substantially violated. To address the limitations of previous estimators, we develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space where we optimize slate abstractions to minimize the bias and variance of LIPS in a data-driven way. By doing so, LIPS can substantially reduce the variance of IPS without imposing restrictive assumptions on the reward function structure like linearity. Through empirical evaluation, we demonstrate that LIPS substantially outperforms existing estimators, particularly in scenarios with non-linear rewards and large slate spaces.
翻译:我们研究了剧集式上下文赌博机问题中的离线评估(OPE),其中策略选择称为剧集的多维动作。该问题在推荐系统、搜索引擎、市场营销以及医疗应用中广泛存在,然而,典型的逆倾向得分(IPS)估计器由于动作空间巨大而遭受显著方差,使得有效的OPE成为重大挑战。为缓解方差问题,伪逆(PI)估计器通过假设奖励函数具有线性性被提出,但这一假设难以从观测数据中验证且常被严重违反,导致显著偏差。为克服先前估计器的局限性,我们开发了一种针对剧集式赌博机OPE的新型估计器——潜在IPS(LIPS),其在低维剧集抽象空间中定义重要性权重,并通过数据驱动方式优化剧集抽象以最小化LIPS的偏差和方差。通过这种方式,LIPS能显著降低IPS的方差,且无需对奖励函数结构施加如线性性等限制性假设。通过实证评估,我们证明LIPS在非线性奖励和大规模剧集空间场景中显著优于现有估计器。