Computational argumentation has become an essential tool in various fields, including artificial intelligence, law, and public policy. It is an emerging research field in natural language processing that attracts increasing attention. Research on computational argumentation mainly involves two types of tasks: argument mining and argument generation. As large language models have demonstrated strong abilities in understanding context and generating natural language, it is worthwhile to evaluate the performance of LLMs on various computational argumentation tasks. This work aims to embark on an assessment of LLMs, such as ChatGPT, Flan models and LLaMA2 models, under zero-shot and few-shot settings within the realm of computational argumentation. We organize existing tasks into six main categories and standardise the format of fourteen open-sourced datasets. In addition, we present a new benchmark dataset on counter speech generation, that aims to holistically evaluate the end-to-end performance of LLMs on argument mining and argument generation. Extensive experiments show that LLMs exhibit commendable performance across most of these datasets, demonstrating their capabilities in the field of argumentation. Our analysis offers valuable suggestions for evaluating computational argumentation and its integration with LLMs in future research endeavors.
翻译:计算论证已成为人工智能、法律和公共政策等多个领域的重要工具,它是自然语言处理中一个日益受到关注的新兴研究领域。计算论证研究主要涉及两类任务:论证挖掘和论证生成。由于大型语言模型在理解上下文和生成自然语言方面展现出强大能力,评估LLM在各种计算论证任务上的表现具有重要意义。本研究旨在对ChatGPT、Flan模型和LLaMA2模型等大型语言模型在零样本和少样本设置下进行计算论证领域的评估。我们将现有任务归纳为六大类别,并标准化了十四个开源数据集的格式。此外,我们提出了一个关于反言论生成的新基准数据集,旨在全面评估LLM在论证挖掘和论证生成上的端到端性能。大量实验表明,LLM在大多数数据集上表现出色,展现了其在论证领域的能力。我们的分析为未来研究中评估计算论证及其与LLM的整合提供了有价值的建议。