Counterfactual learning to rank (CLTR ) can be risky; various circumstances can cause it to produce sub-optimal models that hurt performance when deployed. Safe CLTR was introduced to mitigate these risks when using inverse propensity scoring to correct for position bias. However, the existing safety measure for CLTR is not applicable to state-of-the-art CLTR, it cannot handle trust bias, and its guarantees rely on specific assumptions about user behavior. Our contributions are two-fold. First, we generalize the existing safe CLTR approach to make it applicable to state-of-the-art doubly robust (DR) CLTR and trust bias. Second, we propose a novel approach, proximal ranking policy optimization (PRPO ), that provides safety in deployment without assumptions about user behavior. PRPO removes incentives for learning ranking behavior that is too dissimilar to a safe ranking model. Thereby, PRPO imposes a limit on how much learned models can degrade performance metrics, without relying on any specific user assumptions. Our experiments show that both our novel safe doubly robust method and PRPO provide higher performance than the existing safe inverse propensity scoring approach. However, when circumstances are unexpected, the safe doubly robust approach can become unsafe and bring detrimental performance. In contrast, PRPO always maintains safety, even in maximally adversarial situations. By avoiding assumptions, PRPO is the first method with unconditional safety in deployment that translates to robust safety for real-world applications.
翻译:反事实排序学习(CLTR)存在风险;多种情况可能导致其产生次优模型,在部署时损害性能。为减轻使用逆倾向评分校正位置偏差时的风险,安全CLTR方法被提出。然而,现有CLTR安全度量不适用于最先进的反事实排序学习,无法处理信任偏差,且其保证依赖于用户行为的特定假设。我们的贡献包括两方面。首先,我们将现有安全CLTR方法推广至适用于最先进的双重稳健(DR)CLTR及信任偏差场景。其次,我们提出一种新方法——近端排序策略优化(PRPO),该方法能在部署中提供不依赖用户行为假设的安全性。PRPO消除了学习与安全排序模型差异过大的排序行为的动机,从而在不依赖任何特定用户假设的前提下,对学习模型可能造成的性能指标下降施加限制。实验表明,我们提出的新型安全双重稳健方法与PRPO均比现有安全逆倾向评分方法具有更高性能。然而,当遇到意外情况时,安全双重稳健方法可能变得不安全并导致性能损害。相比之下,PRPO始终保持安全性,即使在极端对抗环境下亦然。通过避免假设依赖,PRPO成为首个具备无条件部署安全性的方法,这为实际应用提供了鲁棒的安全保障。