Computational argumentation has become an essential tool in various domains, including law, public policy, and artificial intelligence. It is an emerging research field in natural language processing that attracts increasing attention. Research on computational argumentation mainly involves two types of tasks: argument mining and argument generation. As large language models (LLMs) have demonstrated impressive capabilities in understanding context and generating natural language, it is worthwhile to evaluate the performance of LLMs on diverse computational argumentation tasks. This work aims to embark on an assessment of LLMs, such as ChatGPT, Flan models, and LLaMA2 models, in both zero-shot and few-shot settings. We organize existing tasks into six main categories and standardize the format of fourteen openly available datasets. In addition, we present a new benchmark dataset on counter speech generation that aims to holistically evaluate the end-to-end performance of LLMs on argument mining and argument generation. Extensive experiments show that LLMs exhibit commendable performance across most of the datasets, demonstrating their capabilities in the field of argumentation. Our analysis offers valuable suggestions for evaluating computational argumentation and its integration with LLMs in future research endeavors.
翻译:计算论证已成为法律、公共政策和人工智能等多个领域的重要工具。作为自然语言处理中一个新兴的研究方向,该领域正受到越来越多的关注。计算论证的研究主要涉及两类任务:论证挖掘与论证生成。鉴于大型语言模型(LLMs)在理解上下文和生成自然语言方面展现出卓越能力,有必要评估LLMs在不同计算论证任务上的表现。本研究旨在对ChatGPT、Flan系列模型及LLaMA2系列模型等在零样本和少样本设置下的性能进行系统性评估。我们将现有任务归纳为六大类别,并对十四个公开数据集的格式进行了标准化处理。此外,我们提出了一个关于反言论生成的新基准数据集,旨在全面评估LLMs在论证挖掘与论证生成端到端任务中的综合性能。大量实验表明,LLMs在大多数数据集上均表现出色,展现了其在论证领域的强大能力。我们的分析为未来研究中计算论证的评估及其与LLMs的融合提供了有价值的参考建议。