Recent works have argued that high-level semantic concepts are encoded "linearly" in the representation space of large language models. In this work, we study the origins of such linear representations. To that end, we introduce a simple latent variable model to abstract and formalize the concept dynamics of the next token prediction. We use this formalism to show that the next token prediction objective (softmax with cross-entropy) and the implicit bias of gradient descent together promote the linear representation of concepts. Experiments show that linear representations emerge when learning from data matching the latent variable model, confirming that this simple structure already suffices to yield linear representations. We additionally confirm some predictions of the theory using the LLaMA-2 large language model, giving evidence that the simplified model yields generalizable insights.
翻译:最近的研究表明,高级语义概念在大型语言模型的表示空间中被“线性”编码。本文旨在探究这种线性表示的起源。为此,我们引入了一个简单的潜变量模型来抽象并形式化下一标记预测的概念动态。利用这一形式化框架,我们证明下一标记预测的目标(带交叉熵的softmax)与梯度下降的隐式偏差共同促进了概念的线性表示。实验表明,当从符合该潜变量模型的数据中学习时,线性表示会自然出现,这证实了这种简单结构已足以产生线性表示。此外,我们利用LLaMA-2大型语言模型验证了理论中的一些预测,为简化模型能够得出可推广的洞察提供了证据。