Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely related questions: What does "linear representation" actually mean? And, how do we make sense of geometric notions (e.g., cosine similarity or projection) in the representation space? To answer these, we use the language of counterfactuals to give two formalizations of "linear representation", one in the output (word) representation space, and one in the input (sentence) space. We then prove these connect to linear probing and model steering, respectively. To make sense of geometric notions, we use the formalization to identify a particular (non-Euclidean) inner product that respects language structure in a sense we make precise. Using this causal inner product, we show how to unify all notions of linear representation. In particular, this allows the construction of probes and steering vectors using counterfactual pairs. Experiments with LLaMA-2 demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product.
翻译:非正式地,“线性表示假说”是指高级概念在某种表示空间中沿特定方向以线性方式呈现的观点。本文探讨两个密切相关的问题:“线性表示”究竟意味着什么?以及如何理解表示空间中的几何概念(如余弦相似度或投影)?为解答这些问题,我们借助反事实语言对“线性表示”给出了两种形式化定义:一种基于输出(词)表示空间,另一种基于输入(句子)空间。随后,我们证明这两种形式化分别与线性探测和模型操控相关联。为理解几何概念,我们利用形式化方法识别出一种特定(非欧几里得)内积,该内积在可精确界定的意义上尊重语言结构。借助这种因果内积,我们展示了如何统一所有线性表示的概念。特别地,这允许使用反事实对构造探测器和操控向量。基于LLaMA-2的实验证明了:概念线性表示的存在性、与解释和控制的内在联系,以及内积选择的根本性作用。