LLM-Driven Usefulness Judgment for Web Search Evaluation

Evaluation is fundamental in optimizing search experiences and supporting diverse user intents in Information Retrieval (IR). Traditional search evaluation methods primarily rely on relevance labels, which assess how well retrieved documents match a user's query. However, relevance alone fails to capture a search system's effectiveness in helping users achieve their search goals, making usefulness a critical evaluation criterion. In this paper, we explore an alternative approach: LLM-generated usefulness labels, which incorporate both implicit and explicit user behavior signals to evaluate document usefulness. We propose Task-aware Rubric-based Usefulness Evaluation (TRUE), a rubric-driven evaluation method that employs iterative sampling and reasoning to model complex search behavior patterns. Our findings show that (i) LLMs can generate moderate usefulness labels by leveraging comprehensive search session history incorporating personalization and contextual understanding, and (ii) fine-tuned LLMs improve usefulness judgments when provided with structured search session contexts. Additionally, we examine whether LLMs can distinguish between relevance and usefulness, particularly in cases where this divergence impacts search success. We also conduct an ablation study to identify key metrics for accurate usefulness label generation, optimizing for token efficiency and cost-effectiveness in real-world applications. This study advances LLM-based usefulness evaluation by refining key user metrics, exploring LLM-generated label reliability, and ensuring feasibility for large-scale search systems.

翻译：摘要：评估对于优化搜索体验和支持信息检索（IR）中多样化的用户意图至关重要。传统的搜索评估方法主要依赖相关性标签，衡量检索文档与用户查询的匹配程度。然而，仅凭相关性无法完全反映搜索系统帮助用户实现搜索目标的有效性，这使得有用性成为关键评估标准。本文探索了一种替代方法：通过LLM生成有用性标签，该方法融合隐式和显式用户行为信号来评估文档有用性。我们提出了任务感知的基于评分标准的有用性评估（TRUE），这是一种基于评分标准的评估方法，通过迭代采样和推理来模拟复杂的搜索行为模式。研究结果表明：（i）LLM能够通过利用包含个性化和上下文理解的全面搜索会话历史，生成中等水平的有用性标签；（ii）在提供结构化搜索会话上下文时，微调后的LLM能改善有用性判断。此外，我们考察了LLM在相关性-有用性分歧影响搜索成功的关键场景中区分两者的能力。我们还通过消融实验识别精确生成有用性标签的关键指标，以在实际应用中优化词元效率与成本效益。本研究通过优化关键用户指标、探索LLM生成标签的可靠性并确保其在大规模搜索系统中的可行性，推动了基于LLM的有用性评估发展。