For a language model (LM) to faithfully model human language, it must compress vast, potentially infinite information into relatively few dimensions. We propose analyzing compression in (pre-trained) LMs from two points of view: geometric and information-theoretic. We demonstrate that the two views are highly correlated, such that the intrinsic geometric dimension of linguistic data predicts their coding length under the LM. We then show that, in turn, high compression of a linguistic dataset predicts rapid adaptation to that dataset, confirming that being able to compress linguistic information is an important part of successful LM performance. As a practical byproduct of our analysis, we evaluate a battery of intrinsic dimension estimators for the first time on linguistic data, showing that only some encapsulate the relationship between information-theoretic compression, geometric compression, and ease-of-adaptation.
翻译:为让语言模型(LM)忠实地建模人类语言,它必须将庞杂且可能无限的信息压缩到相对较少的维度中。我们提出从几何与信息论两个视角分析(预训练)LM的压缩过程。研究表明,这两种视角高度相关:语言数据的内在几何维度可预测其在LM下的编码长度。进一步地,我们发现对语言数据集的高压缩率预示着模型能快速适应该数据集,这证实了语言信息的压缩能力是LM成功性能的关键组成部分。作为本研究的实用副产品,我们首次在语言数据上评估了一系列内在维度估计器,结果显示仅部分估计器能够捕捉信息论压缩、几何压缩与适应易用性三者之间的关系。