The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80 percent redundancy relative to the five bits per character expected for random text. We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language, providing a first-principles account of this redundancy level. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-word level. The semantic structure of the text can then be hierarchically decomposed, allowing for analytical treatment. Numerical experiments with modern LLMs and open datasets suggest that our model quantitatively captures the structure of real texts at different levels of the semantic hierarchy. The entropy rate predicted by our model agrees with the estimated entropy rate of printed English. Moreover, our theory further reveals that the entropy rate of natural language is not fixed but should increase systematically with the semantic complexity of corpora, which are captured by the only free parameter in our model.
翻译:印刷英语的熵率被著名地估计为每个字符约一比特,这一基准直到最近才被现代大语言模型(LLMs)所接近。该熵率意味着,相对于随机文本预期的每个字符五比特,英语包含近80%的冗余度。我们引入了一个统计模型,试图捕捉自然语言复杂的多尺度结构,从而从第一性原理出发解释这一冗余水平。我们的模型描述了一种将文本自相似地分割为语义连贯的块直至单词级别的过程。文本的语义结构随后可以被层次化分解,从而允许进行解析处理。利用现代LLMs和开放数据集进行的数值实验表明,我们的模型在数量上捕捉了真实文本在语义层次不同级别的结构。我们模型预测的熵率与印刷英语的估计熵率相符。此外,我们的理论进一步揭示,自然语言的熵率并非固定不变,而应随着语料库语义复杂度的增加而系统性上升,这一复杂度由我们模型中唯一的自由参数所刻画。