Language models often exhibit undesirable behaviors, such as gender bias or toxic language. Interventions in the representation space were shown effective in mitigating such issues by altering the LM behavior. We first show that two prominent intervention techniques, Linear Erasure and Steering Vectors, do not enable a high degree of control and are limited in expressivity. We then propose a novel intervention methodology for generating expressive counterfactuals in the representation space, aiming to make representations of a source class (e.g., ``toxic'') resemble those of a target class (e.g., ``non-toxic''). This approach, generalizing previous linear intervention techniques, utilizes a closed-form solution for the Earth Mover's problem under Gaussian assumptions and provides theoretical guarantees on the representation space's geometric organization. We further build on this technique and derive a nonlinear intervention that enables controlled generation. We demonstrate the effectiveness of the proposed approaches in mitigating bias in multiclass classification and in reducing the generation of toxic language, outperforming strong baselines.
翻译:语言模型常表现出不良行为,如性别偏见或有害语言。现有研究表明,通过干预表示空间可有效缓解此类问题,从而改变语言模型的行为。我们首先证明两种主流干预技术——线性擦除(Linear Erasure)与引导向量(Steering Vectors)——在控制精度与表达力方面存在局限性。进而提出一种新型干预方法,用于在表示空间中生成表达力丰富的反事实样本,旨在使源类别(如"有害")的表示相似于目标类别(如"无害")的表示。该方法延续并泛化先前线性干预技术,基于高斯假设下的推土机问题(Earth Mover's Problem)闭式解,为表示空间的几何结构组织提供理论保证。我们进一步在此技术基础上推导出非线性干预机制,实现受控生成。实验证明,所提方法在多分类任务中有效缓解偏见,并减少有害语言生成,显著超越强基线模型。