Large language models rely on real-valued representations of text to make their predictions. These representations contain information learned from the data that the model has trained on, including knowledge of linguistic properties and forms of demographic bias, e.g., based on gender. A growing body of work has considered removing information about concepts such as these using orthogonal projections onto subspaces of the representation space. We contribute to this body of work by proposing a formal definition of $\textit{intrinsic}$ information in a subspace of a language model's representation space. We propose a counterfactual approach that avoids the failure mode of spurious correlations (Kumar et al., 2022) by treating components in the subspace and its orthogonal complement independently. We show that our counterfactual notion of information in a subspace is optimized by a $\textit{causal}$ concept subspace. Furthermore, this intervention allows us to attempt concept controlled generation by manipulating the value of the conceptual component of a representation. Empirically, we find that R-LACE (Ravfogel et al., 2022) returns a one-dimensional subspace containing roughly half of total concept information under our framework. Our causal controlled intervention shows that, for at least one model, the subspace returned by R-LACE can be used to manipulate the concept value of the generated word with precision.
翻译:大型语言模型依靠文本的实值表示来做出预测。这些表示包含了模型从训练数据中学到的信息,包括语言属性知识和基于性别等特征的人口偏见。越来越多的工作考虑通过使用表示空间子空间上的正交投影来移除这些概念信息。我们提出了一种正式定义,用于描述语言模型表示空间子空间中的*内在*信息,从而为该领域做出贡献。我们提出了一种反事实方法,通过将子空间中的分量与其正交补中的分量独立处理,避免了虚假相关(Kumar等人,2022年)的失败模式。我们证明,子空间中的信息反事实概念通过一个*因果*概念子空间得到优化。进一步地,这种干预使我们能够通过操控表示的概念分量值来尝试概念可控生成。实验结果表明,在我们的框架下,R-LACE(Ravfogel等人,2022年)返回的一维子空间包含约一半的总概念信息。我们的因果控制干预显示,对于至少一个模型,R-LACE返回的子空间可用于精确操控生成词的概念值。