Counterfactual learning to rank (CLTR) aims to learn a ranking policy from user interactions while correcting for the inherent biases in interaction data, such as position bias. Existing CLTR methods assume a single ranking policy that selects top-K ranking from the entire document candidate set. In real-world applications, the candidate document set is on the order of millions, making a single-stage ranking policy impractical. In order to scale to millions of documents, real-world ranking systems are designed in a two-stage fashion, with a candidate generator followed by a ranker. The existing CLTR method for a two-stage offline ranking system only considers the top-1 ranking set-up and only focuses on training the candidate generator, with the ranker fixed. A CLTR method for training both the ranker and candidate generator jointly is missing from the existing literature. In this paper, we propose a two-stage CLTR estimator that considers the interaction between the two stages and estimates the joint value of the two policies offline. In addition, we propose a novel joint optimization method to train the candidate and ranker policies, respectively. To the best of our knowledge, we are the first to propose a CLTR estimator and learning method for two-stage ranking. Experimental results on a semi-synthetic benchmark demonstrate the effectiveness of the proposed joint CLTR method over baselines.
翻译:反事实排序学习(CLTR)旨在从用户交互中学习排序策略,同时纠正交互数据中固有的偏差(例如位置偏差)。现有CLTR方法假设存在单一排序策略,从整个文档候选集中选取前K个排序结果。在实际应用中,候选文档集规模可达百万量级,使得单阶段排序策略难以实施。为适应百万级文档规模,实际排序系统通常采用两阶段设计:先由候选生成器筛选,再通过排序器精排。现有针对两阶段离线排序系统的CLTR方法仅考虑前1排序场景,且仅专注于训练候选生成器(排序器固定)。当前文献尚缺乏能够联合训练排序器与候选生成器的CLTR方法。本文提出一种两阶段CLTR估计器,该估计器考虑两阶段间的交互作用,并离线评估双策略的联合价值。此外,我们提出一种新颖的联合优化方法,分别训练候选生成策略与排序策略。据我们所知,这是首次针对两阶段排序提出CLTR估计器与学习方法。在半合成基准测试上的实验结果表明,所提出的联合CLTR方法相较于基线模型具有显著优势。