Resources for Automated Evaluation of Assistive RAG Systems that Help Readers with News Trustworthiness Assessment

Many readers today struggle to assess the trustworthiness of online news because reliable reporting coexists with misinformation. The TREC 2025 DRAGUN (Detection, Retrieval, and Augmented Generation for Understanding News) Track provided a venue for researchers to develop and evaluate assistive RAG systems that support readers' news trustworthiness assessment by producing reader-oriented, well-attributed reports. As the organizers of the DRAGUN track, we describe the resources that we have newly developed to allow for the reuse of the track's tasks. The track had two tasks: (Task 1) Question Generation, producing 10 ranked investigative questions; and (Task 2, the main task) Report Generation, producing a 250-word report grounded in the MS MARCO V2.1 Segmented Corpus. As part of the track's evaluation, we had TREC assessors create importance-weighted rubrics of questions with expected short answers for 30 different news articles. These rubrics represent the information that assessors believe is important for readers to assess an article's trustworthiness. The assessors then used their rubrics to manually judge the participating teams' submitted runs. To make these tasks and their rubrics reusable, we have created an automated process to judge runs not part of the original assessing. We show that our AutoJudge ranks existing runs well compared to the TREC human-assessed evaluation (Kendall's $τ= 0.678$ for Task 1 and $τ= 0.872$ for Task 2). These resources enable both the evaluation of RAG systems for assistive news trustworthiness assessment and, with the human evaluation as a benchmark, research on improving automated RAG evaluation.

翻译：当前许多读者在评估在线新闻可信度时面临挑战，因为可靠报道与错误信息往往并存。TREC 2025 DRAGUN（面向新闻理解的检测、检索与增强生成）赛道为研究者提供了开发与评估辅助性RAG系统的平台，这些系统通过生成面向读者且具备充分溯源的报告来支持新闻可信度评估。作为DRAGUN赛道的组织者，我们描述了新开发的资源，旨在实现赛道任务的可复用性。该赛道包含两项任务：（任务1）问题生成——生成10个排序后的调查性问题；（任务2，核心任务）报告生成——基于MS MARCO V2.1分段语料库生成250词的报告。在赛道评估环节，我们邀请TREC评估员为30篇不同新闻文章创建了带重要性权重的问题评分标准及其预期简短答案。这些评分标准体现了评估员认为对读者判断文章可信度至关重要的信息。随后，评估员依据该标准对参赛队伍提交的运行结果进行人工评判。为实现任务及其评分标准的可复用性，我们建立了自动化流程以评判原始评估范围之外的运行结果。实验表明，相较于TREC人工评估结果，我们的AutoJudge系统对现有运行结果的排序表现良好（任务1的Kendall $τ=0.678$，任务2的$τ=0.872$）。这些资源不仅支持辅助性新闻可信度评估RAG系统的评价，更能以人工评估为基准，推动自动化RAG评估方法的改进研究。