For a language model (LM) to faithfully model human language, it must compress vast, potentially infinite information into relatively few dimensions. We propose analyzing compression in (pre-trained) LMs from two points of view: geometric and information-theoretic. We demonstrate that the two views are highly correlated, such that the intrinsic geometric dimension of linguistic data predicts their coding length under the LM. We then show that, in turn, high compression of a linguistic dataset predicts rapid adaptation to that dataset, confirming that being able to compress linguistic information is an important part of successful LM performance. As a practical byproduct of our analysis, we evaluate a battery of intrinsic dimension estimators for the first time on linguistic data, showing that only some encapsulate the relationship between information-theoretic compression, geometric compression, and ease-of-adaptation.
翻译:对于要忠实建模人类语言的语言模型(LM),它必须将海量且潜在无限的信息压缩到相对较少的维度中。我们提出从几何和信息论两个角度分析(预训练)LM中的压缩过程。我们证明这两个视角高度相关,使得语言数据的内在几何维度能够预测其在LM下的编码长度。进而表明,语言数据集的高压缩率预示着对该数据集的快速适应能力,这证实了压缩语言信息的能力是LM成功表现的重要组成部分。作为我们分析的实际副产品,我们首次在语言数据上评估了一系列内在维度估计器,结果显示只有部分估计器能够体现信息论压缩、几何压缩与适应便捷性之间的关系。