Large Language Models (LLMs) are increasingly assessed and utilized in the field of Argument Mining (AM), thanks to their strong general reasoning capabilities. However, standard training-free models often miss sophisticated details, specifically in contexts where two parts of the text have to be analyzed together. Furthermore, self-correction mechanisms tend to reinforce initial hallucinations in reasoning. Overcoming these limitations typically requires expensive, domain-specific supervised fine-tuning. Recent work has shown that a multi-agent paradigm can address such weaknesses for the component classification task through dialectical refinement with a Proponent-Opponent-Judge architecture, setting a promising direction for training-free approaches in the field. In this paper, we extend and evaluate this framework on the Argument Relation Identification and Classification (ARIC) task, reformulating it as a debate over component pairs. Besides that, we introduce a confidence gating mechanism that enables debating only on the uncertain cases and accepting the initial prediction when confidence is high. On the UKP Argument Annotated Essays v2 corpus, we demonstrate that the selective debate achieves the highest Macro F1 among all training-free methods, while debate over all samples degrades performance below that of one of the baselines. All generative approaches also outperform fine-tuned RoBERTa models on Macro F1, suggesting that the under-representation of the Attack class was more damaging to supervised fine-tuning than to inference-only models. Additionally, our framework produces human-readable debate transcripts, offering interpretability absent from both single-agent and supervised classifiers.
翻译:大型语言模型因其强大的通用推理能力,在论元挖掘领域受到越来越多的评估与应用。然而,标准无训练模型常忽略精细细节,尤其是在需要联合分析文本两个部分的语境中。此外,自我纠正机制往往强化了初始推理中的幻觉。克服这些局限通常需要昂贵的领域特定监督微调。近期研究显示,多智能体范式可通过正方-反方-裁判架构的辩证完善来应对组件分类任务中的此类缺陷,为该领域无训练方法开辟了有前景的方向。本文将此框架扩展并评估于论元关系识别与分类任务,将其重构为组件对间的辩论。此外,我们引入置信度门控机制,仅对不确定案例进行辩论,在置信度较高时接受初始预测。在UKP论元注释论文v2语料库上,我们证明选择性辩论在所有无训练方法中取得了最高宏F1值,而对所有样本进行辩论反而使性能低于基线之一。所有生成方法在宏F1上也优于微调后的RoBERTa模型,表明攻击类别的代表性不足对监督微调的损害大于纯推理模型。此外,我们的框架生成人类可读的辩论记录,提供了单智能体和监督分类器所不具备的可解释性。