Researchers use recall to evaluate rankings across a variety of retrieval, recommendation, and machine learning tasks. While there is a colloquial interpretation of recall in set-based evaluation, the research community is far from a principled understanding of recall metrics for rankings. The lack of principled understanding of or motivation for recall has resulted in criticism amongst the retrieval community that recall is useful as a measure at all. In this light, we reflect on the measurement of recall in rankings from a formal perspective. Our analysis is composed of three tenets: recall, robustness, and lexicographic evaluation. First, we formally define `recall-orientation' as sensitivity to movement of the bottom-ranked relevant item. Second, we analyze our concept of recall orientation from the perspective of robustness with respect to possible searchers and content providers. Finally, we extend this conceptual and theoretical treatment of recall by developing a practical preference-based evaluation method based on lexicographic comparison. Through extensive empirical analysis across 17 TREC tracks, we establish that our new evaluation method, lexirecall, is correlated with existing recall metrics and exhibits substantially higher discriminative power and stability in the presence of missing labels. Our conceptual, theoretical, and empirical analysis substantially deepens our understanding of recall and motivates its adoption through connections to robustness and fairness.
翻译:研究人员使用召回率来评估检索、推荐和机器学习任务中的排序效果。尽管在基于集合的评估中,召回率存在一种非正式的解释,但学术界对排序召回率指标的深入理解仍远未达成共识。由于缺乏对召回率的原理性理解或动机说明,检索界出现了对其作为度量标准有效性的批评。鉴于此,我们从形式化视角反思排序中召回率的测度问题。我们的分析包含三个核心原则:召回率、鲁棒性和词典序评估。首先,我们形式化定义"召回导向"为对底部相关项位置变化的敏感性。其次,我们从可能用户与内容提供者的鲁棒性角度分析召回导向概念。最后,我们通过发展基于词典序比较的实用偏好评估方法,将召回率的概念性与理论性处理延伸至实践层面。通过对17个TREC任务组的广泛实证分析,我们证实新评估方法lexirecall与现有召回率指标具有相关性,并在标注缺失情况下展现出显著更高的区分力与稳定性。我们的概念、理论与实证分析极大地深化了对召回率的理解,并通过与鲁棒性和公平性的关联为其应用提供了理论支撑。