Although counterfactual explanations are a popular approach to explain ML black-box classifiers, they are less widespread in NLP. Most methods find those explanations by iteratively perturbing the target document until it is classified differently by the black box. We identify two main families of counterfactual explanation methods in the literature, namely, (a) \emph{transparent} methods that perturb the target by adding, removing, or replacing words, and (b) \emph{opaque} approaches that project the target document into a latent, non-interpretable space where the perturbation is carried out subsequently. This article offers a comparative study of the performance of these two families of methods on three classical NLP tasks. Our empirical evidence shows that opaque approaches can be an overkill for downstream applications such as fake news detection or sentiment analysis since they add an additional level of complexity with no significant performance gain. These observations motivate our discussion, which raises the question of whether it makes sense to explain a black box using another black box.
翻译:尽管反事实解释是解释机器学习黑盒分类器的流行方法,但在自然语言处理领域应用较少。大多数方法通过迭代扰动目标文档直至黑盒分类器对其做出不同判断来寻找解释。我们识别出文献中两类主要的反事实解释方法:(a)透明方法,通过增删或替换词语扰动目标文本;(b)不透明方法,将目标文档投影到潜在不可解释空间后再进行扰动。本文通过三个经典自然语言处理任务对比研究这两类方法的性能。实验证据表明,不透明方法对假新闻检测或情感分析等下游应用可能过于复杂——它们在未带来显著性能提升的同时增加了额外复杂度。这些观察引发讨论,提出用一个黑盒解释另一个黑盒是否合理的问题。