As NLP models become more complex, understanding their decisions becomes more crucial. Counterfactuals (CFs), where minimal changes to inputs flip a model's prediction, offer a way to explain these models. While Large Language Models (LLMs) have shown remarkable performance in NLP tasks, their efficacy in generating high-quality CFs remains uncertain. This work fills this gap by investigating how well LLMs generate CFs for two NLU tasks. We conduct a comprehensive comparison of several common LLMs, and evaluate their CFs, assessing both intrinsic metrics, and the impact of these CFs on data augmentation. Moreover, we analyze differences between human and LLM-generated CFs, providing insights for future research directions. Our results show that LLMs generate fluent CFs, but struggle to keep the induced changes minimal. Generating CFs for Sentiment Analysis (SA) is less challenging than NLI where LLMs show weaknesses in generating CFs that flip the original label. This also reflects on the data augmentation performance, where we observe a large gap between augmenting with human and LLMs CFs. Furthermore, we evaluate LLMs' ability to assess CFs in a mislabelled data setting, and show that they have a strong bias towards agreeing with the provided labels. GPT4 is more robust against this bias and its scores correlate well with automatic metrics. Our findings reveal several limitations and point to potential future work directions.
翻译:随着NLP模型日益复杂,理解其决策机制变得愈发关键。反事实解释通过对输入进行最小改动以翻转模型预测,为解释此类模型提供了一种有效途径。尽管大型语言模型在NLP任务中展现出卓越性能,但其生成高质量反事实解释的能力仍不明确。本研究通过探究LLMs在两项自然语言理解任务中生成反事实解释的效果填补了这一空白。我们对多种常见LLMs进行了全面比较,从内在指标和数据增强效果两个维度评估其生成的反事实解释。此外,我们分析了人工生成与LLM生成反事实解释的差异,为未来研究方向提供洞见。实验结果表明:LLMs能生成流畅的反事实解释,但难以保持最小改动程度;在情感分析任务中生成反事实解释的难度低于自然语言推理任务——后者暴露出LLMs在翻转原始标签方面的缺陷。这一现象也体现在数据增强性能上:使用人工生成与LLM生成反事实解释进行数据增强的效果存在显著差距。进一步地,我们在错误标注数据场景下评估LLMs评判反事实解释的能力,发现其存在强烈认同给定标签的偏向性。GPT4对此类偏向性具有更强鲁棒性,其评分与自动评估指标高度相关。本研究揭示了若干局限性,并指出了潜在的未来研究方向。