One well motivated explanation method for classifiers leverages counterfactuals which are hypothetical events identical to real observations in all aspects except for one categorical feature. Constructing such counterfactual poses specific challenges for texts, however, as some attribute values may not necessarily align with plausible real-world events. In this paper we propose a simple method for generating counterfactuals by intervening in the space of text representations which bypasses this limitation. We argue that our interventions are minimally disruptive and that they are theoretically sound as they align with counterfactuals as defined in Pearl's causal inference framework. To validate our method, we first conduct experiments on a synthetic dataset of counterfactuals, allowing for a direct comparison between classifier predictions based on ground truth counterfactuals (obtained through explicit text interventions) and our counterfactuals, derived through interventions in the representation space. Second, we study a real world scenario where our counterfactuals can be leveraged both for explaining a classifier and for bias mitigation.
翻译:一个动机明确的分类器解释方法利用了反事实——这些假设事件除一个分类特征外,在所有方面都与真实观测相同。然而,对文本而言构建此类反事实面临特定挑战,因为某些属性值未必能与真实世界中的合理事件对齐。在本文中,我们提出一种通过在文本表示空间中进行干预以生成反事实的简单方法,该方法绕过了这一限制。我们认为,我们的干预具有最小破坏性,并且在理论上与Pearl因果推断框架中定义的反事实相一致。为验证该方法,我们首先在一个合成反事实数据集上进行实验,从而能够直接比较基于真实反事实(通过显式文本干预获得)的分类器预测与通过表示空间干预导出的反事实预测。其次,我们研究了一个现实场景,其中反事实既可用于解释分类器,也可用于偏差缓解。