In the information retrieval (IR) domain, evaluation plays a crucial role in optimizing search experiences and supporting diverse user intents. In the recent LLM era, research has been conducted to automate document relevance labels, as these labels have traditionally been assigned by crowd-sourced workers - a process that is both time and consuming and costly. This study focuses on LLM-generated usefulness labels, a crucial evaluation metric that considers the user's search intents and task objectives, an aspect where relevance falls short. Our experiment utilizes task-level, query-level, and document-level features along with user search behavior signals, which are essential in defining the usefulness of a document. Our research finds that (i) pre-trained LLMs can generate moderate usefulness labels by understanding the comprehensive search task session, (ii) pre-trained LLMs perform better judgement in short search sessions when provided with search session contexts. Additionally, we investigated whether LLMs can capture the unique divergence between relevance and usefulness, along with conducting an ablation study to identify the most critical metrics for accurate usefulness label generation. In conclusion, this work explores LLM-generated usefulness labels by evaluating critical metrics and optimizing for practicality in real-world settings.
翻译:在信息检索领域,评估对于优化搜索体验及支持多样化的用户意图至关重要。在当前大语言模型时代,已有研究致力于自动化生成文档相关性标注,因为这类标注传统上依赖于众包工作者——这一过程既耗时又昂贵。本研究聚焦于大语言模型生成的有用性标注,这是一种关键的评估指标,它考虑了用户的搜索意图和任务目标,而这正是相关性指标所欠缺的维度。我们的实验综合运用了任务级、查询级和文档级特征以及用户搜索行为信号,这些要素对于界定文档的有用性至关重要。研究发现:(i)预训练大语言模型能够通过理解完整的搜索任务会话生成中等质量的有用性标注;(ii)当提供搜索会话上下文时,预训练大语言模型在短会话场景中能做出更准确的判断。此外,我们探究了大语言模型能否捕捉相关性与有用性之间的本质差异,并通过消融实验确定了影响有用性标注生成准确性的关键指标。综上所述,本研究通过评估关键指标并优化实际应用场景的可行性,深入探索了大语言模型生成有用性标注的潜力。