Conventional information retrieval is concerned with identifying the relevance of texts for a given query. Yet, the conventional definition of relevance is dominated by aspects of similarity in texts, leaving unobserved whether the text is truly useful for addressing the query. For instance, when answering whether Paris is larger than Berlin, texts about Paris being in France are relevant (lexical/semantic similarity), but not useful. In this paper, we introduce UsefulBench, a domain-specific dataset curated by three professional analysts labeling whether a text is connected to a query (relevance) or holds practical value in responding to it (usefulness). We show that classic similarity-based information retrieval aligns more strongly with relevance. While LLM-based systems can counteract this bias, we find that domain-specific problems require a high degree of expertise, which current LLMs do not fully incorporate. We explore approaches to (partially) overcome this challenge. However, UsefulBench presents a dataset challenge for targeted information retrieval systems.
翻译:摘要:传统信息检索关注于识别文本与给定查询的相关性。然而,相关性的传统定义主要受文本相似性维度主导,未能考察文本是否真正有助于解决查询问题。例如,在回答"巴黎是否大于柏林"时,关于"巴黎位于法国"的文本虽具相关性(词汇/语义相似性),却缺乏实用性。本文提出UsefulBench——由三位专业分析师标注的数据集,该数据集专门界定文本是"与查询关联"(相关性)还是"对响应查询具有实践价值"(有用性)。我们发现,基于相似性的经典信息检索更倾向于对齐相关性。尽管基于大语言模型的系统能部分抵消这种偏差,但领域特定问题需要高度专业知识,当前大语言模型尚无法完全掌握。我们探索了(不完全)克服该挑战的方法。然而,UsefulBench为定向信息检索系统提供了具有挑战性的数据集。