Concept erasure aims to remove specified features from a representation. It can improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). We introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while changing the representation as little as possible, as measured by a broad class of norms. We apply LEACE to large language models with a novel procedure called "concept scrubbing," which erases target concept information from every layer in the network. We demonstrate our method on two tasks: measuring the reliance of language models on part-of-speech information, and reducing gender bias in BERT embeddings. Code is available at https://github.com/EleutherAI/concept-erasure.
翻译:概念擦除旨在移除表示中的指定特征。该方法可提升公平性(例如防止分类器使用性别或种族信息)与可解释性(例如移除某一概念以观察模型行为的变化)。我们提出最小二乘概念擦除(LEACE),这是一种闭式解法,能够在不改变表示的情况下——以广义范数类为度量标准——最小化检测效果,从而确保所有线性分类器均无法探测到目标概念。我们将LEACE应用于大型语言模型,并采用名为“概念擦洗”(concept scrubbing)的新颖流程,该流程可擦除网络中每一层的目标概念信息。我们在两项任务上验证了该方法:测量语言模型对词性信息的依赖程度,以及减少BERT嵌入中的性别偏见。代码开源于https://github.com/EleutherAI/concept-erasure。