We explore the topology of representation manifolds arising in autoregressive neural language models trained on raw text data. In order to study their properties, we introduce tools from computational algebraic topology, which we use as a basis for a measure of topological complexity, that we call perforation. Using this measure, we study the evolution of topological structure in GPT based large language models across depth and time during training. We then compare these to gated recurrent models, and show that the latter exhibit more topological complexity, with a distinct pattern of changes common to all natural languages but absent from synthetically generated data. The paper presents a detailed analysis of the representation manifolds derived by these models based on studying the shapes of vector clouds induced by them as they are conditioned on sentences from corpora of natural language text. The methods developed in this paper are novel in the field and based on mathematical apparatus that might be unfamiliar to the target audience. To help with that we introduce the minimum necessary theory, and provide additional visualizations in the appendices. The main contribution of the paper is a striking observation about the topological structure of the transformer as compared to LSTM based neural architectures. It suggests that further research into mathematical properties of these neural networks is necessary to understand the operation of large transformer language models. We hope this work inspires further explorations in this direction within the NLP community.
翻译:我们探讨了基于原始文本数据训练的自回归神经语言模型中所呈现的表征流形的拓扑结构。为研究其性质,我们引入了计算代数拓扑工具,并以此为基础提出了一种称为"穿孔度"的拓扑复杂性度量。利用该度量,我们研究了基于GPT的大规模语言模型在深度和时间维度上拓扑结构的演化过程。随后将其与门控循环模型进行对比,发现后者展现出更高的拓扑复杂性,且存在所有自然语言共有的独特变化模式——这种模式在合成生成数据中完全缺失。本文基于这些模型对自然语言语料库中句子进行条件约束时所产生的向量云形态,系统分析了其表征流形。本文开发的方法在该领域具有创新性,其数学基础可能对目标读者较为陌生。为便于理解,我们介绍了必要的理论基础,并在附录中提供了补充可视化材料。本文的主要贡献在于揭示了Transformer与基于LSTM的神经架构在拓扑结构上的显著差异。这表明需要进一步研究这些神经网络的数学性质,以理解大型Transformer语言模型的运行机制。我们期望这项工作能激励自然语言处理社区在此方向开展更深入的探索。