Computational argumentation has become an essential tool in various fields, including artificial intelligence, law, and public policy. It is an emerging research field in natural language processing (NLP) that attracts increasing attention. Research on computational argumentation mainly involves two types of tasks: argument mining and argument generation. As large language models (LLMs) have demonstrated strong abilities in understanding context and generating natural language, it is worthwhile to evaluate the performance of LLMs on various computational argumentation tasks. This work aims to embark on an assessment of LLMs, such as ChatGPT, Flan models and LLaMA2 models, under zero-shot and few-shot settings within the realm of computational argumentation. We organize existing tasks into 6 main classes and standardise the format of 14 open-sourced datasets. In addition, we present a new benchmark dataset on counter speech generation, that aims to holistically evaluate the end-to-end performance of LLMs on argument mining and argument generation. Extensive experiments show that LLMs exhibit commendable performance across most of these datasets, demonstrating their capabilities in the field of argumentation. We also highlight the limitations in evaluating computational argumentation and provide suggestions for future research directions in this field.
翻译:计算论证已成为人工智能、法律和公共政策等多个领域的重要工具,也是自然语言处理(NLP)中一个日益受关注的新兴研究领域。计算论证的研究主要涉及两类任务:论证挖掘和论证生成。鉴于大型语言模型(LLM)在理解上下文和生成自然语言方面展现出强大能力,评估LLM在各种计算论证任务中的表现具有重要意义。本研究旨在系统评估ChatGPT、Flan模型和LLaMA2模型等LLM在计算论证领域的零样本和少样本场景下的性能。我们将现有任务划分为6个主要类别,并标准化了14个开源数据集的格式。此外,我们提出了一个用于反言论生成的新基准数据集,旨在全面评估LLM在论证挖掘和论证生成方面的端到端性能。大量实验表明,LLM在大多数数据集上表现优异,展现了其在论证领域的潜力。我们还指出了当前计算论证评估中的局限性,并为该领域的未来研究方向提供了建议。