We use large language models (LLMs) to uncover long-ranged structure in English texts from a variety of sources. The conditional entropy or code length in many cases continues to decrease with context length at least to $N\sim 10^4$ characters, implying that there are direct dependencies or interactions across these distances. A corollary is that there are small but significant correlations between characters at these separations, as we show from the data independent of models. The distribution of code lengths reveals an emergent certainty about an increasing fraction of characters at large $N$. Over the course of model training, we observe different dynamics at long and short context lengths, suggesting that long-ranged structure is learned only gradually. Our results constrain efforts to build statistical physics models of LLMs or language itself.
翻译:我们利用大型语言模型(LLMs)揭示了来自多种来源的英语文本中的长程结构。在许多情况下,条件熵或编码长度至少会随着上下文长度(直至约 $N\sim 10^4$ 个字符)持续下降,这意味着在这些距离上存在直接的依赖关系或相互作用。一个推论是,在这些间隔距离上,字符之间存在微小但显著的相关性,正如我们从独立于模型的数据中所展示的那样。编码长度的分布揭示了在大 $N$ 条件下,对逐渐增大的字符比例存在一种涌现的确定性。在模型训练过程中,我们观察到长上下文长度和短上下文长度下的不同动态特征,这表明长程结构是逐渐习得的。我们的研究结果对构建LLMs或语言本身的统计物理模型的努力构成了约束。