The rapid advancement of natural language processing, information retrieval (IR), computer vision, and other technologies has presented significant challenges in evaluating the performance of these systems. One of the main challenges is the scarcity of human-labeled data, which hinders the fair and accurate assessment of these systems. In this work, we specifically focus on evaluating IR systems with sparse labels, borrowing from recent research on evaluating computer vision tasks. taking inspiration from the success of using Fr\'echet Inception Distance (FID) in assessing text-to-image generation systems. We propose leveraging the Fr\'echet Distance to measure the distance between the distributions of relevant judged items and retrieved results. Our experimental results on MS MARCO V1 dataset and TREC Deep Learning Tracks query sets demonstrate the effectiveness of the Fr\'echet Distance as a metric for evaluating IR systems, particularly in settings where a few labels are available. This approach contributes to the advancement of evaluation methodologies in real-world scenarios such as the assessment of generative IR systems.
翻译:自然语言处理、信息检索(IR)、计算机视觉等技术的快速发展,对这些系统性能的评估提出了重大挑战。主要挑战之一在于人工标注数据的稀缺性,这阻碍了对系统公平准确的评估。本研究借鉴近期计算机视觉任务评估的研究成果,专门聚焦于稀疏标签条件下信息检索系统的评估问题。受Fréchet初始距离(FID)在文本到图像生成系统评估中成功应用的启发,我们提出利用Fréchet距离来衡量相关判定项目分布与检索结果分布之间的距离。在MS MARCO V1数据集和TREC深度学习赛道查询集上的实验结果表明,Fréchet距离作为评估信息检索系统的指标具有有效性,尤其在仅有少量标签可用的情况下。该方法为现实场景中评估方法论的发展做出了贡献,例如生成式信息检索系统的评估。