Researchers use recall to evaluate rankings across a variety of retrieval, recommendation, and machine learning tasks. While there is a colloquial interpretation of recall in set-based evaluation, the research community is far from a principled understanding of recall metrics for rankings. The lack of principled understanding of or motivation for recall has resulted in criticism amongst the retrieval community that recall is useful as a measure at all. In this light, we reflect on the measurement of recall in rankings from a formal perspective. Our analysis is composed of three tenets: recall, robustness, and lexicographic evaluation. First, we formally define `recall-orientation' as sensitivity to movement of the bottom-ranked relevant item. Second, we analyze our concept of recall orientation from the perspective of robustness with respect to possible searchers and content providers. Finally, we extend this conceptual and theoretical treatment of recall by developing a practical preference-based evaluation method based on lexicographic comparison. Through extensive empirical analysis across 17 TREC tracks, we establish that our new evaluation method, lexirecall, is correlated with existing recall metrics and exhibits substantially higher discriminative power and stability in the presence of missing labels. Our conceptual, theoretical, and empirical analysis substantially deepens our understanding of recall and motivates its adoption through connections to robustness and fairness.
翻译:研究人员在各类检索、推荐及机器学习任务中广泛使用召回率来评估排序结果。尽管在基于集合的评估中,召回率存在一种口语化解释,但研究界对排序任务中召回率指标的系统性理解尚不充分。由于缺乏对召回率的原理性认知或动机阐释,检索领域内一直存在质疑——召回率作为评估指标是否真正具有实用价值。基于此,我们从形式化视角重新审视排序任务中召回率的测量问题。本文分析围绕三个核心准则展开:召回率、稳健性与词典序评估。首先,我们形式化定义"召回导向性"为对最末位相关项位置变动的敏感度;其次,从应对潜在搜索用户与内容提供者的稳健性角度分析召回导向性概念;最后,通过发展基于词典序比较的实用偏好评估方法,深化对召回率的概念与理论探讨。基于17个TREC评测任务的广泛实证分析表明,我们提出的新评估方法"lexirecall"既与现有召回率指标保持相关性,又能在缺失标签场景下展现显著更高的区分力与稳定性。本文的概念、理论与实证分析大幅深化了对召回率的理解,并通过关联稳健性与公平性为其应用提供了理论依据。