Generation of citation-backed reports is a primary use case for retrieval-augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, tools designed for report generation are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recently proposed ARGUE framework for report generation evaluation. We present analysis of Auto-ARGUE on the report generation pilot task from the TREC 2024 NeuCLIR track and on two tasks from the TREC 2024 RAG track, showing good system-level correlations with human judgments. Additionally, we release ARGUE-Viz, a web app for visualization and fine-grained analysis of Auto-ARGUE judgments and scores.
翻译:[translated abstract in Chinese] 生成附有引证的报告是检索增强型生成系统的一项主要应用场景。尽管目前存在多种针对检索增强型生成任务的开源评估工具,但专门用于报告生成评估的工具仍相对匮乏。为此,我们提出Auto-ARGUE——一种基于大语言模型的稳健实现方案,该方案基于近期提出的ARGUE框架,专门用于报告生成评估。我们在TREC 2024 NeuCLIR赛道中的报告生成试点任务以及TREC 2024 RAG赛道的两个任务上对Auto-ARGUE进行了分析,结果表明其与人工判断在系统层面具有良好相关性。此外,我们还发布了ARGUE-Viz,这是一个用于可视化和细粒度分析Auto-ARGUE判断结果与评分的网络应用程序。