Researchers use recall to evaluate rankings across a variety of retrieval, recommendation, and machine learning tasks. While there is a colloquial interpretation of recall in set-based evaluation, the research community is far from a principled understanding of recall metrics for rankings. The lack of principled understanding of or motivation for recall has resulted in criticism amongst the retrieval community that recall is useful as a measure at all. In this light, we reflect on the measurement of recall in rankings from a formal perspective. Our analysis is composed of three tenets: recall, robustness, and lexicographic evaluation. First, we formally define `recall-orientation' as sensitivity to movement of the bottom-ranked relevant item. Second, we analyze our concept of recall orientation from the perspective of robustness with respect to possible searchers and content providers. Finally, we extend this conceptual and theoretical treatment of recall by developing a practical preference-based evaluation method based on lexicographic comparison. Through extensive empirical analysis across 17 TREC tracks, we establish that our new evaluation method, lexirecall, is correlated with existing recall metrics and exhibits substantially higher discriminative power and stability in the presence of missing labels. Our conceptual, theoretical, and empirical analysis substantially deepens our understanding of recall and motivates its adoption through connections to robustness and fairness.
翻译:研究人员使用"召回率"来评估各种检索、推荐和机器学习任务中的排序结果。尽管在基于集合的评估中,人们对召回率有一种直观的解释,但研究界对排序任务中的召回率度量尚未形成理论上的共识。由于缺乏对召回率理论依据或动机的深入理解,检索领域出现了对其作为度量标准有效性的质疑。基于此,我们从形式化角度重新审视排序任务中的召回率测量问题。我们的分析包含三大支柱:召回率、鲁棒性与词典序评估。首先,我们形式化定义了"召回导向性"(recall-orientation)这一概念,将其描述为对底部分类相关项位置变动的敏感性。其次,我们从应对潜在搜索者与内容提供者的鲁棒性视角,分析了召回导向性概念。最后,我们通过开发基于词典序比较的实用偏好导向评估方法,扩展了对召回率的概念与理论分析。通过对17个TREC(文本检索会议)评测任务的广泛实证分析,我们证明新型评估方法"词典召回率"(lexirecall)与现有召回率指标具有相关性,并在标签缺失场景下展现出显著更高的区分度与稳定性。我们的概念分析、理论分析与实证分析深化了对召回率的理解,并通过连接鲁棒性与公平性为其应用提供了理论支撑。