Stat-weight: Improving the Estimator of Interleaved Methods Outcomes with Statistical Hypothesis Testing

from arxiv, This preprint has not undergone peer review (when applicable) or any post-submission improvements or corrections. The Version of Record of this contribution is published in Advances in Information Retrieval 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April, 2023, Proceedings, Part III, and is available online at https://doi.org/10.1007/978-3-031-28241-6_2

Interleaving is an online evaluation approach for information retrieval systems that compares the effectiveness of ranking functions in interpreting the users' implicit feedback. Previous work such as Hofmann et al (2011) has evaluated the most promising interleaved methods at the time, on uniform distributions of queries. In the real world, ordinarily, there is an unbalanced distribution of repeated queries that follows a long-tailed users' search demand curve. The more a query is executed, by different users (or in different sessions), the higher the probability of collecting implicit feedback (interactions/clicks) on the related search results. This paper first aims to replicate the Team Draft Interleaving accuracy evaluation on uniform query distributions and then focuses on assessing how this method generalizes to long-tailed real-world scenarios. The reproducibility work raised interesting considerations on how the winning ranking function for each query should impact the overall winner for the entire evaluation. Based on what was observed, we propose that not all the queries should contribute to the final decision in equal proportion. As a result of these insights, we designed two variations of the $\Delta_{AB}$ score winner estimator that assign to each query a credit based on statistical hypothesis testing. To replicate, reproduce and extend the original work, we have developed from scratch a system that simulates a search engine and users' interactions from datasets from the industry. Our experiments confirm our intuition and show that our methods are promising in terms of accuracy, sensitivity, and robustness to noise.

翻译：交叉评估是一种在线评估信息检索系统的方法，通过比较排序函数在解释用户隐式反馈时的有效性。此前Hofmann等人（2011）的工作在均匀查询分布上评估了当时最具前景的交叉评估方法。在现实世界中，重复查询的分布通常是不均衡的，呈现长尾用户搜索需求曲线特征。查询被执行次数越多（由不同用户或在不同会话中），收集相关搜索结果隐式反馈（交互/点击）的概率就越高。本文首先旨在复制均匀查询分布上的Team Draft Interleaving准确性评估，进而评估该方法如何泛化到长尾真实场景。再现性工作提出了一个值得思考的问题：每个查询的获胜排序函数应如何影响整个评估的总体胜者。基于观察结果，我们提出所有查询不应以相同比例贡献最终决策。基于这些发现，我们设计了两种基于统计假设检验为每个查询分配权重的$\Delta_{AB}$得分胜者估计器变体。为复现、再现并扩展原始工作，我们从零开发了一个模拟搜索引擎及用户交互的系统（基于工业数据集）。实验证实了我们的直觉，并表明所提方法在准确性、敏感性和噪声鲁棒性方面具有良好表现。