We evaluate the ability of contemporary large language models (LLMs) to perform argumentative reasoning. We frame our experiments in terms of the argument mining (AM) and argument pair extraction (APE) tasks, and evaluate their ability to perform reasoning at increasing levels of abstraction in the input and output representations (e.g., arbitrary label sets, semantic graphs). We find that, although LLMs are able to match or surpass the state-of-the-art in AM and APE, their argumentative reasoning performance is very dependent on the input and output representation. We also find an "exemplar effect", where too many exemplars increasingly become detrimental for task performance, and about 4-5 being the optimal amount. Neither result extends to chain-of-thought (CoT) prompting: we find the exemplar effect to be nullified, and our results suggest that CoT allows for better performance under ill-conditioned problems. We hope that the work reported contributes to the improvement of argumentative reasoning in LLMs.
翻译:我们评估了当代大型语言模型(LLMs)执行论证推理的能力。我们将实验框架设定在论证挖掘(AM)和论证对提取(APE)任务中,并评估它们在输入和输出表示不同抽象层次(例如,任意标签集、语义图)下的推理能力。我们发现,尽管LLMs能够在AM和APE任务上达到或超越当前最佳水平,但其论证推理性能高度依赖于输入和输出表示方式。我们还发现一个“示例效应”——过多的示例会逐渐对任务表现产生负面影响,而大约4-5个示例为最优数量。这两种结果在思维链(CoT)提示中均不成立:示例效应被消除,且我们的结果表明,CoT能够在不良条件下实现更优性能。我们希望本工作能为提升LLMs中的论证推理能力做出贡献。