Transformers pretrained via next token prediction learn to factor their world into parts, representing these factors in orthogonal subspaces of the residual stream. We formalize two representational hypotheses: (1) a representation in the product space of all factors, whose dimension grows exponentially with the number of parts, or (2) a factored representation in orthogonal subspaces, whose dimension grows linearly. The factored representation is lossless when factors are conditionally independent, but sacrifices predictive fidelity otherwise, creating a tradeoff between dimensional efficiency and accuracy. We derive precise predictions about the geometric structure of activations for each, including the number of subspaces, their dimensionality, and the arrangement of context embeddings within them. We test between these hypotheses on transformers trained on synthetic processes with known latent structure. Models learn factored representations when factors are conditionally independent, and continue to favor them early in training even when noise or hidden dependencies undermine conditional independence, reflecting an inductive bias toward factoring at the cost of fidelity. This provides a principled explanation for why transformers decompose the world into parts, and suggests that interpretable low dimensional structure may persist even in models trained on complex data.
翻译:通过下一词元预测预训练的 Transformer 模型学会将其世界分解为若干部分,并将这些因子表示在残差流的正交子空间中。我们形式化了两种表示假设:(1) 所有因子乘积空间中的表示,其维度随部分数量呈指数增长;或 (2) 正交子空间中的因子化表示,其维度呈线性增长。当因子条件独立时,因子化表示是无损的,但在其他情况下会牺牲预测保真度,从而在维度效率与准确性之间形成权衡。我们推导了关于每种假设下激活值几何结构的精确预测,包括子空间数量、其维度以及上下文嵌入在其中的排列方式。我们在具有已知潜在结构的合成过程上训练的 Transformer 模型上检验这两种假设。当因子条件独立时,模型学习因子化表示;即使在训练早期,当噪声或隐藏依赖破坏条件独立性时,模型仍倾向于因子化表示,这反映了以保真度为代价进行因子分解的归纳偏好。这为 Transformer 为何将世界分解为部分提供了一个原理性解释,并表明即使在复杂数据上训练的模型中,可解释的低维结构可能仍然存在。