Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely related questions: What does "linear representation" actually mean? And, how do we make sense of geometric notions (e.g., cosine similarity or projection) in the representation space? To answer these, we use the language of counterfactuals to give two formalizations of "linear representation", one in the output (word) representation space, and one in the input (sentence) space. We then prove these connect to linear probing and model steering, respectively. To make sense of geometric notions, we use the formalization to identify a particular (non-Euclidean) inner product that respects language structure in a sense we make precise. Using this causal inner product, we show how to unify all notions of linear representation. In particular, this allows the construction of probes and steering vectors using counterfactual pairs. Experiments with LLaMA-2 demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product.
翻译:非正式地说,“线性表示假设”是指高级概念在表示空间中作为方向线性表示的观点。本文探讨两个密切相关的问题:“线性表示”的实际含义是什么?以及如何理解表示空间中的几何概念(例如余弦相似度或投影)?为回答这些问题,我们使用反事实语言对“线性表示”给出两种形式化定义:一种在输出(词汇)表示空间,另一种在输入(句子)空间。随后证明这两种形式化分别与线性探测和模型导向相关联。为理解几何概念,我们通过形式化方法识别出一种特定的(非欧几里得)内积,该内积在某种精确意义上保持了语言结构。利用这种因果内积,我们展示了如何统一所有线性表示的概念。特别地,这使得能够使用反事实对构建探测器和导向向量。基于LLaMA-2的实验验证了概念线性表示的存在性、与解释和控制的关联性,以及内积选择的根本作用。