Generation of long-form, citation-backed reports is a primary use case for retrieval augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, ones tailored to report generation (RG) are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recently proposed ARGUE framework for RG evaluation. We present analysis of Auto-ARGUE on the RG pilot task from the TREC 2024 NeuCLIR track, showing good system-level correlations with human judgments. We further release a web app for visualization of Auto-ARGUE outputs.
翻译:生成具有引用支持的长篇报告是检索增强生成(RAG)系统的一个主要应用场景。尽管目前存在多种针对不同RAG任务的开源评估工具,但专门针对报告生成(RG)的评估工具仍然缺乏。为此,我们提出了Auto-ARGUE,这是对近期提出的ARGUE报告生成评估框架的一个鲁棒的、基于大语言模型(LLM)的实现。我们在TREC 2024 NeuCLIR赛道中的报告生成试点任务上对Auto-ARGUE进行了分析,结果显示其在系统层面与人工评判具有良好的相关性。我们进一步发布了一个用于可视化Auto-ARGUE输出结果的网络应用。