Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural language models, an encoding of the undesirable behavior is often present in the model's representations. Thus, one natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations in a manner that reduces the probability of it generating undesirable text. This paper investigates the formal and empirical properties of steering functions, i.e., transformation of the neural language model's representations that alter its behavior. First, we derive two optimal, in the least-squares sense, affine steering functions under different constraints. Our theory provides justification for existing approaches and offers a novel, improved steering approach. Second, we offer a series of experiments that demonstrate the empirical effectiveness of the methods in mitigating bias and reducing toxic generation.
翻译:语言模型常表现出不良行为,例如生成有害或具有性别偏见的文本。对于神经语言模型而言,这些不良行为的编码通常存在于模型的表征中。因此,防止模型表现出不良行为的一种自然(且常见)方法是以降低其生成不良文本概率的方式引导模型的表征。本文研究了导向函数的形式化与实证特性,即通过改变神经语言模型表征的变换来调整其行为。首先,我们在不同约束条件下推导出两种最小二乘意义下的最优仿射导向函数。我们的理论为现有方法提供了依据,并提出了一种新颖且改进的导向方法。其次,我们通过一系列实验证明了这些方法在缓解偏见和减少有害文本生成方面的实证有效性。