A language model (LM) is a mapping from a linguistic context to an output token. However, much remains to be known about this mapping, including how its geometric properties relate to its function. We take a high-level geometric approach to its analysis, observing, across five pre-trained transformer-based LMs and three input datasets, a distinct phase characterized by high intrinsic dimensionality. During this phase, representations (1) correspond to the first full linguistic abstraction of the input; (2) are the first to viably transfer to downstream tasks; (3) predict each other across different LMs. Moreover, we find that an earlier onset of the phase strongly predicts better language modelling performance. In short, our results suggest that a central high-dimensionality phase underlies core linguistic processing in many common LM architectures.
翻译:语言模型(LM)是一种从语言上下文到输出标记的映射。然而,关于此映射的许多方面仍有待探索,包括其几何特性如何与其功能相关联。我们采用高层次的几何方法对其进行分析,在五个预训练的基于Transformer的语言模型和三个输入数据集上观察到一个以高本征维度为特征的独特相。在此相期间,表征(1)对应于输入的第一个完整语言抽象;(2)首次能够有效迁移至下游任务;(3)在不同语言模型间可相互预测。此外,我们发现该相的更早出现强烈预示着更好的语言建模性能。简而言之,我们的结果表明,在许多常见语言模型架构中,一个核心的高维相构成了语言处理的基础。