ConQRet: Benchmarking Fine-Grained Evaluation of Retrieval Augmented Argumentation with LLM Judges

Computational argumentation, which involves generating answers or summaries for controversial topics like abortion bans and vaccination, has become increasingly important in today's polarized environment. Sophisticated LLM capabilities offer the potential to provide nuanced, evidence-based answers to such questions through Retrieval-Augmented Argumentation (RAArg), leveraging real-world evidence for high-quality, grounded arguments. However, evaluating RAArg remains challenging, as human evaluation is costly and difficult for complex, lengthy answers on complicated topics. At the same time, re-using existing argumentation datasets is no longer sufficient, as they lack long, complex arguments and realistic evidence from potentially misleading sources, limiting holistic evaluation of retrieval effectiveness and argument quality. To address these gaps, we investigate automated evaluation methods using multiple fine-grained LLM judges, providing better and more interpretable assessments than traditional single-score metrics and even previously reported human crowdsourcing. To validate the proposed techniques, we introduce ConQRet, a new benchmark featuring long and complex human-authored arguments on debated topics, grounded in real-world websites, allowing an exhaustive evaluation across retrieval effectiveness, argument quality, and groundedness. We validate our LLM Judges on a prior dataset and the new ConQRet benchmark. Our proposed LLM Judges and the ConQRet benchmark can enable rapid progress in computational argumentation and can be naturally extended to other complex retrieval-augmented generation tasks.

翻译：计算论证涉及为诸如堕胎禁令和疫苗接种等争议性话题生成答案或摘要，在当今两极分化的环境中变得日益重要。先进的LLM能力通过检索增强论证（RAArg）提供了为这类问题提供细致入微、基于证据的答案的潜力，其利用现实世界的证据来构建高质量、有依据的论证。然而，评估RAArg仍然具有挑战性，因为人工评估成本高昂，且对于复杂主题下冗长而复杂的答案难以进行。同时，复用现有的论证数据集已不再足够，因为这些数据集缺乏长而复杂的论证以及来自可能具有误导性来源的现实证据，从而限制了对检索有效性和论证质量的整体评估。为弥补这些不足，我们研究了使用多个细粒度LLM评判器的自动化评估方法，该方法比传统的单一分数指标甚至先前报道的人工众包评估提供了更好且更可解释的评估结果。为验证所提出的技术，我们引入了ConQRet——一个新的基准，其特点在于包含针对争议性话题、基于真实世界网站的长篇复杂人工撰写论证，从而允许对检索有效性、论证质量及依据性进行全面评估。我们在一个现有数据集和新的ConQRet基准上验证了我们的LLM评判器。我们提出的LLM评判器与ConQRet基准能够推动计算论证领域的快速发展，并可自然扩展到其他复杂的检索增强生成任务中。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日