Computational argumentation, which involves generating answers or summaries for controversial topics like abortion bans and vaccination, has become increasingly important in today's polarized environment. Sophisticated LLM capabilities offer the potential to provide nuanced, evidence-based answers to such questions through Retrieval-Augmented Argumentation (RAArg), leveraging real-world evidence for high-quality, grounded arguments. However, evaluating RAArg remains challenging, as human evaluation is costly and difficult for complex, lengthy answers on complicated topics. At the same time, re-using existing argumentation datasets is no longer sufficient, as they lack long, complex arguments and realistic evidence from potentially misleading sources, limiting holistic evaluation of retrieval effectiveness and argument quality. To address these gaps, we investigate automated evaluation methods using multiple fine-grained LLM judges, providing better and more interpretable assessments than traditional single-score metrics and even previously reported human crowdsourcing. To validate the proposed techniques, we introduce ConQRet, a new benchmark featuring long and complex human-authored arguments on debated topics, grounded in real-world websites, allowing an exhaustive evaluation across retrieval effectiveness, argument quality, and groundedness. We validate our LLM Judges on a prior dataset and the new ConQRet benchmark. Our proposed LLM Judges and the ConQRet benchmark can enable rapid progress in computational argumentation and can be naturally extended to other complex retrieval-augmented generation tasks.
翻译:计算论证涉及为诸如堕胎禁令和疫苗接种等争议性话题生成答案或摘要,在当今两极分化的环境中变得日益重要。先进的LLM能力通过检索增强论证(RAArg)提供了为这类问题提供细致入微、基于证据的答案的潜力,其利用现实世界的证据来构建高质量、有依据的论证。然而,评估RAArg仍然具有挑战性,因为人工评估成本高昂,且对于复杂主题下冗长而复杂的答案难以进行。同时,复用现有的论证数据集已不再足够,因为这些数据集缺乏长而复杂的论证以及来自可能具有误导性来源的现实证据,从而限制了对检索有效性和论证质量的整体评估。为弥补这些不足,我们研究了使用多个细粒度LLM评判器的自动化评估方法,该方法比传统的单一分数指标甚至先前报道的人工众包评估提供了更好且更可解释的评估结果。为验证所提出的技术,我们引入了ConQRet——一个新的基准,其特点在于包含针对争议性话题、基于真实世界网站的长篇复杂人工撰写论证,从而允许对检索有效性、论证质量及依据性进行全面评估。我们在一个现有数据集和新的ConQRet基准上验证了我们的LLM评判器。我们提出的LLM评判器与ConQRet基准能够推动计算论证领域的快速发展,并可自然扩展到其他复杂的检索增强生成任务中。