Deep sequence models are said to store atomic facts predominantly in the form of associative memory: a brute-force lookup of co-occurring entities. We identify a dramatically different form of storage of atomic facts that we term as geometric memory. Here, the model has synthesized embeddings encoding novel global relationships between all entities, including ones that do not co-occur in training. Such storage is powerful: for instance, we show how it transforms a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn $1$-step navigation task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, as against a lookup of local associations, cannot be straightforwardly attributed to typical supervisory, architectural, or optimizational pressures. Counterintuitively, a geometry is learned even when it is more complex than the brute-force lookup. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that -- in contrast to prevailing theories -- indeed arises naturally despite the lack of various pressures. This analysis also points out to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery, and unlearning.
翻译:深度序列模型被认为主要以关联记忆的形式存储原子事实:即对共现实体的暴力查找。我们识别出一种截然不同的原子事实存储形式,称之为几何记忆。在此形式下,模型合成了编码所有实体间新颖全局关系的嵌入表示,包括那些在训练中未共现的实体。这种存储方式具有强大能力:例如,我们展示了它如何将涉及$\ell$重组合的困难推理任务转化为易于学习的单步导航任务。从这一现象中,我们提取出难以解释的神经嵌入几何的基本特征。我们认为,这种几何结构的兴起(相对于局部关联的查找)不能简单地归因于典型的监督、架构或优化压力。反直觉的是,即使几何结构比暴力查找更为复杂,模型仍会学习它。随后,通过分析其与Node2Vec的关联,我们证明了这种几何结构源于一种谱偏置——与主流理论相反——这种偏置确实会在缺乏各种压力的情况下自然产生。该分析也为实践者指出了使Transformer记忆更具几何性的可见改进空间。我们希望参数记忆的几何视角能鼓励研究者重新审视在知识获取、容量、发现与遗忘等领域中默认的直觉认知。