How do transformer language models memorize factual associations? A common view casts internal weight matrices as associative memories over pairs of embeddings, requiring parameter counts that scale linearly with the number of facts. We develop a theoretical and empirical account of an alternative, \emph{geometric} form of memorization in which learned embeddings encode relational structure directly, and the MLP plays a qualitatively different role. In a controlled setting where a single-layer transformer must memorize random bijections from subjects to a shared attribute set, we prove that a logarithmic embedding dimension suffices: subject embeddings encode \emph{linear superpositions} of their associated attribute vectors, and a small MLP acts as a relation-conditioned selector that extracts the relevant attribute via ReLU gating, and not as an associative key-value mapping. We extend these results to the multi-hop setting -- chains of relational queries such as ``Who is the mother of the wife of $x$?'' -- providing constructions with and without chain-of-thought that exhibit a provable capacity-depth tradeoff, complemented by a matching information-theoretic lower bound. Empirically, gradient descent discovers solutions with precisely the predicted structure. Once trained, the MLP transfers zero-shot to entirely new bijections when subject embeddings are appropriately re-initialized, revealing that it has learned a generic selection mechanism rather than memorized any particular set of facts.
翻译:Transformer语言模型如何记忆事实关联?一种常见观点将内部权重矩阵视为嵌入对上的关联存储器,其参数数量随事实数量线性增长。我们发展了一种替代性几何记忆机制的理论与实证解释:在这种机制中,学习得到的嵌入直接编码关系结构,而多层感知机(MLP)扮演着性质不同的角色。在单层Transformer必须记忆从主体到共享属性集的随机双射这一受控场景中,我们证明对数维度的嵌入维度就足够:主体嵌入编码其关联属性向量的线性叠加,小型MLP作为关系条件选择器,通过ReLU门控提取相关属性——而非作为关联键值映射。我们将这些结果推广到多跳场景(如“谁是$x$妻子的母亲?”这类关系查询链),提供了有无思维链两种构造方法,展现出可证明的容量-深度权衡,并辅以匹配的信息论下界。实验表明,梯度下降能够精确发现具有预测结构的解。当主体嵌入被适当重新初始化后,训练好的MLP能零样本迁移至全新的双射关系——这表明它学习的是通用选择机制,而非记忆任何特定事实集。