Unbiased learning-to-rank (ULTR) is a well-established framework for learning from user clicks, which are often biased by the ranker collecting the data. While theoretically justified and extensively tested in simulation, ULTR techniques lack empirical validation, especially on modern search engines. The Baidu-ULTR dataset released for the WSDM Cup 2023, collected from Baidu's search engine, offers a rare opportunity to assess the real-world performance of prominent ULTR techniques. Despite multiple submissions during the WSDM Cup 2023 and the subsequent NTCIR ULTRE-2 task, it remains unclear whether the observed improvements stem from applying ULTR or other learning techniques. In this work, we revisit and extend the available experiments on the Baidu-ULTR dataset. We find that standard unbiased learning-to-rank techniques robustly improve click predictions but struggle to consistently improve ranking performance, especially considering the stark differences obtained by choice of ranking loss and query-document features. Our experiments reveal that gains in click prediction do not necessarily translate to enhanced ranking performance on expert relevance annotations, implying that conclusions strongly depend on how success is measured in this benchmark.
翻译:无偏学习排序(ULTR)是一种成熟的框架,用于从用户点击中学习,而用户点击往往因收集数据的排序器而产生偏差。尽管在理论上被证明合理并在模拟中广泛测试,但ULTR技术缺乏实证验证,尤其是在现代搜索引擎上。为WSDM Cup 2023发布的百度ULTR数据集(来自百度搜索引擎)提供了一个评估主流ULTR技术真实性能的罕见机会。尽管在WSDM Cup 2023及后续的NTCIR ULTRE-2任务中有多篇提交,但尚不清楚观察到的改进源自应用ULTR还是其他学习技术。在本研究中,我们重新审视并扩展了百度ULTR数据集的现有实验。我们发现,标准无偏学习排序技术能稳健地提升点击预测,但难以持续改进排序性能,尤其是考虑到排序损失和查询-文档特征选择所带来的显著差异。我们的实验表明,点击预测的提升未必能转化为基于专家相关性标注的排序性能提升,这意味着该基准中的结论很大程度上取决于成功指标的定义。