LLM-based relevance judgment generation has become a crucial approach in advancing evaluation methodologies in Information Retrieval (IR). It has progressed significantly, often showing high correlation with human judgments as reflected in LLMJudge leaderboards \cite{rahmani2025judging}. However, existing methods for relevance judgments, rely heavily on sensitive prompting strategies, lacking standardized workflows for generating reliable labels. To fill this gap, we reintroduce our method, \textit{Task-aware Rubric-based Evaluation} (TRUE), for relevance judgment generation. Originally developed for usefulness evaluation in search sessions, we extend TRUE to mitigate the gap in relevance judgment due to its demonstrated effectiveness and reproducible workflow. This framework leverages iterative data sampling and reasoning to evaluate relevance judgments across multiple factors including intent, coverage, specificity, accuracy and usefulness. In this paper, we evaluate TRUE on the TREC DL 2019, 2020 and LLMJudge datasets and our results show that TRUE achieves strong performance on the system-ranking LLM leaderboards. The primary focus of this work is to provide a reproducible framework for LLM-based relevance judgments, and we further analyze the effectiveness of TRUE across multiple dimensions.
翻译:基于大型语言模型(LLM)的相关性判定生成已成为推动信息检索(IR)评估方法发展的关键途径。该领域已取得显著进展,在LLMJudge排行榜中常表现出与人工判定高度的一致性\cite{rahmani2025judging}。然而,现有的相关性判定方法严重依赖敏感的提示策略,缺乏生成可靠标签的标准化工作流程。为填补这一空白,我们重新引入用于相关性判定生成的方法——任务感知的基于量规的评估(Task-aware Rubric-based Evaluation, TRUE)。该方法最初为搜索会话中的有用性评估而开发,鉴于其已证实的有效性和可复现的工作流程,我们将其扩展应用于相关性判定以弥合现有差距。该框架利用迭代数据采样与推理,从意图、覆盖度、特异性、准确性和有用性等多个维度评估相关性判定。本文在TREC DL 2019、2020及LLMJudge数据集上对TRUE进行评估,结果表明TRUE在系统排名的LLM排行榜上取得了优异性能。本工作的核心目标是提供一个可复现的基于LLM的相关性判定框架,并进一步从多维度分析TRUE的有效性。