A popular approach to unveiling the black box of neural NLP models is to leverage saliency methods, which assign scalar importance scores to each input component. A common practice for evaluating whether an interpretability method is faithful has been to use evaluation-by-agreement -- if multiple methods agree on an explanation, its credibility increases. However, recent work has found that saliency methods exhibit weak rank correlations even when applied to the same model instance and advocated for the use of alternative diagnostic methods. In our work, we demonstrate that rank correlation is not a good fit for evaluating agreement and argue that Pearson-$r$ is a better-suited alternative. We further show that regularization techniques that increase faithfulness of attention explanations also increase agreement between saliency methods. By connecting our findings to instance categories based on training dynamics, we show that the agreement of saliency method explanations is very low for easy-to-learn instances. Finally, we connect the improvement in agreement across instance categories to local representation space statistics of instances, paving the way for work on analyzing which intrinsic model properties improve their predisposition to interpretability methods.
翻译:揭示神经NLP模型黑箱的一种流行方法是利用显著性方法,该方法为每个输入分量分配标量重要性分数。评估可解释性方法忠实度的一种常见做法是基于一致性评估——若多种方法对某一解释达成一致,其可信度便随之提升。然而,近期研究发现,即便应用于同一模型实例,显著性方法间仍表现出较弱秩相关性,并主张采用替代性诊断方法。在本研究中,我们论证了秩相关性并非评估一致性的良好指标,并提出皮尔逊相关系数$r$是更适用的替代方案。我们进一步表明,能提升注意力解释忠实度的正则化技术,亦能增强显著性方法间的一致性。通过将研究发现与基于训练动态的实例类别相关联,我们揭示出:对于易学实例,显著性方法解释的一致性极低。最终,我们将跨实例类别的一致性提升与实例的局部表征空间统计特征联系起来,为后续分析哪些内在模型属性可提升其对可解释性方法的适用性奠定基础。