Interventions targeting the representation space of language models (LMs) have emerged as effective means to influence model behavior. These methods are employed, for example, to eliminate or alter the encoding of demographic information such as gender within the model's representations, creating a counterfactual representation. However, since the intervention operates within the representation space, understanding precisely which features it modifies poses a challenge. We show that representation-space counterfactuals can be converted into natural language counterfactuals. We demonstrate that this approach enables us to analyze the linguistic alterations corresponding to a given representation-space intervention and to interpret the features utilized for encoding a specific concept. Moreover, the resulting counterfactuals can be used to mitigate bias in classification.
翻译:针对语言模型(LM)表征空间的干预方法已成为影响模型行为的有效手段。这类方法被用于例如消除或改变模型中人口统计信息(如性别)的编码,从而生成反事实表征。然而,由于干预操作在表征空间内部进行,准确理解其修改的具体特征仍存在挑战。我们证明,表征空间中的反事实可以转化为自然语言反事实。研究表明,该方法能够帮助我们分析与特定表征空间干预相对应的语言变化,并解释编码特定概念所使用的特征。此外,生成的反事实还可用于缓解分类中的偏见。