Conventional information retrieval is concerned with identifying the relevance of texts for a given query. Yet, the conventional definition of relevance is dominated by aspects of similarity in texts, leaving unobserved whether the text is truly useful for addressing the query. For instance, when answering whether Paris is larger than Berlin, texts about Paris being in France are relevant (lexical/semantic similarity), but not useful. In this paper, we introduce UsefulBench, a domain-specific dataset curated by three professional analysts labeling whether a text is connected to a query (relevance) or holds practical value in responding to it (usefulness). We show that classic similarity-based information retrieval aligns more strongly with relevance. While LLM-based systems can counteract this bias, we find that domain-specific problems require a high degree of expertise, which current LLMs do not fully incorporate. We explore approaches to (partially) overcome this challenge. However, UsefulBench presents a dataset challenge for targeted information retrieval systems.
翻译:传统信息检索关注于识别文本与给定查询的相关性。然而,相关性的传统定义主要由文本相似性方面主导,忽略了文本是否真正有助于解决查询。例如,在回答“巴黎是否比柏林更大”时,关于“巴黎位于法国”的文本是相关的(词汇/语义相似性),但并非有用的。本文中,我们介绍了UsefulBench——一个由三位专业分析师标注的领域特定数据集,用于标注文本是与查询存在关联(相关性),还是对回答查询具有实用价值(有用性)。我们表明,经典的基于相似性的信息检索更倾向于与相关性对齐。虽然基于大语言模型的系统可以抵消这种偏差,但我们发现领域特定问题需要高度的专业知识,而当前的大语言模型并未完全具备这一点。我们探讨了(部分)克服这一挑战的方法。然而,UsefulBench为面向目标的信息检索系统提供了一个具有挑战性的数据集。