Counterfactual examples are minimal edits to an input that alter a model's prediction. They are widely employed in explainable AI to probe model behavior and in natural language processing (NLP) to augment training data. However, generating valid counterfactuals with large language models (LLMs) remains challenging, as existing single-pass methods often fail to induce reliable label changes, neglecting LLMs' self-correction capabilities. To explore this untapped potential, we propose iFlip, an iterative refinement approach that leverages three types of feedback, including model confidence, feature attribution, and natural language. Our results show that iFlip achieves an average 57.8% higher validity than the five state-of-the-art baselines, as measured by the label flipping rate. The user study further corroborates that iFlip outperforms baselines in completeness, overall satisfaction, and feasibility. In addition, ablation studies demonstrate that three components are paramount for iFlip to generate valid counterfactuals: leveraging an appropriate number of iterations, pointing to highly attributed words, and early stopping. Finally, counterfactuals generated by iFlip enable effective counterfactual data augmentation, substantially improving model performance and robustness.
翻译:反事实示例是对输入进行最小编辑以改变模型预测的方法。它们在可解释人工智能中被广泛用于探究模型行为,并在自然语言处理中用于增强训练数据。然而,使用大型语言模型生成有效的反事实示例仍然具有挑战性,因为现有的单次生成方法往往无法可靠地引发标签变化,且忽视了LLMs的自我修正能力。为挖掘这一未开发的潜力,我们提出了iFlip——一种利用三类反馈(包括模型置信度、特征归因和自然语言)的迭代优化方法。实验结果表明,以标签翻转率为衡量标准,iFlip的平均有效性比五种最先进的基线方法高出57.8%。用户研究进一步证实,iFlip在完整性、总体满意度和可行性方面均优于基线方法。此外,消融研究表明,iFlip生成有效反事实示例的关键在于三个要素:采用适当迭代次数、聚焦高归因词汇以及实施早停机制。最后,iFlip生成的反事实示例能够实现有效的反事实数据增强,显著提升模型性能与鲁棒性。