Feature attribution methods highlight the important input tokens as explanations to model predictions, which have been widely applied to deep neural networks towards trustworthy AI. However, recent works show that explanations provided by these methods face challenges of being faithful and robust. In this paper, we propose a method with Robustness improvement and Explanation Guided training towards more faithful EXplanations (REGEX) for text classification. First, we improve model robustness by input gradient regularization technique and virtual adversarial training. Secondly, we use salient ranking to mask noisy tokens and maximize the similarity between model attention and feature attribution, which can be seen as a self-training procedure without importing other external information. We conduct extensive experiments on six datasets with five attribution methods, and also evaluate the faithfulness in the out-of-domain setting. The results show that REGEX improves fidelity metrics of explanations in all settings and further achieves consistent gains based on two randomization tests. Moreover, we show that using highlight explanations produced by REGEX to train select-then-predict models results in comparable task performance to the end-to-end method.
翻译:特征归因方法通过突出重要输入标记作为模型预测的解释,已广泛应用于深度神经网络以构建可信AI。然而,近期研究表明这些方法提供的解释在忠实性和鲁棒性方面面临挑战。本文提出一种结合鲁棒性提升与解释引导训练的方法(REGEX),旨在为文本分类生成更忠实的解释。首先,我们通过输入梯度正则化技术和虚拟对抗训练提升模型鲁棒性;其次,利用显著排序掩蔽噪声标记,并最大化模型注意力与特征归因之间的相似性,这可视作无需引入外部信息的自训练过程。我们在六个数据集上结合五种归因方法进行了广泛实验,并评估了域外设置下的解释忠实性。结果表明,REGEX在所有设置下均提升了解释的保真度指标,并通过两项随机化检验取得一致性改进。此外,我们证明使用REGEX生成的突出解释训练"选择-预测"模型,其任务性能可与端到端方法相媲美。