Language models often exhibit undesirable behaviors, such as gender bias or toxic language. Interventions in the representation space were shown effective in mitigating such issues by altering the LM behavior. We first show that two prominent intervention techniques, Linear Erasure and Steering Vectors, do not enable a high degree of control and are limited in expressivity. We then propose a novel intervention methodology for generating expressive counterfactuals in the representation space, aiming to make representations of a source class (e.g., "toxic") resemble those of a target class (e.g., "non-toxic"). This approach, generalizing previous linear intervention techniques, utilizes a closed-form solution for the Earth Mover's problem under Gaussian assumptions and provides theoretical guarantees on the representation space's geometric organization. We further build on this technique and derive a nonlinear intervention that enables controlled generation. We demonstrate the effectiveness of the proposed approaches in mitigating bias in multiclass classification and in reducing the generation of toxic language, outperforming strong baselines.
翻译:语言模型常常展现出不良行为,例如性别偏见或毒性语言。通过干预表示空间已被证明能有效缓解此类问题,因为它能改变语言模型的行为。我们首先表明,两种主流的干预技术——线性擦除和导向向量——无法实现高度可控性且表达能力有限。随后,我们提出一种新颖的干预方法论,用于在表示空间中生成表达性反事实,旨在使源类别(如“毒性”)的表示与目标类别(如“非毒性”)的表示相似。该方法泛化了先前的线性干预技术,利用高斯假设下推土机问题的闭式解,并为表示空间的几何结构提供了理论保证。我们进一步在此技术基础上推导出非线性干预方案,实现了可控生成。我们证明了所提方法在多分类任务中缓解偏见以及减少毒性语言生成方面的有效性,其表现优于强基线方法。