语境敏感随机语言模型中的Berezinskii--Kosterlitz--Thouless相变 (Berezinskii--Kosterlitz--Thouless transition in a context-sensitive random language model)

Several power-law critical properties involving different statistics in natural languages -- reminiscent of scaling properties of physical systems at or near phase transitions -- have been documented for decades. The recent rise of large language models has added further evidence and excitement by providing intriguing similarities with notions in physics such as scaling laws and emergent abilities. However, specific instances of classes of generative language models that exhibit phase transitions, as understood by the statistical physics community, are lacking. In this work, inspired by the one-dimensional Potts model in statistical physics, we construct a simple probabilistic language model that falls under the class of context-sensitive grammars, which we call the context-sensitive random language model, and numerically demonstrate an unambiguous phase transition in the framework of a natural language model. We explicitly show that a precisely defined order parameter -- that captures symbol frequency biases in the sentences generated by the language model -- changes from strictly zero to a strictly nonzero value (in the infinite-length limit of sentences), implying a mathematical singularity arising when tuning the parameter of the stochastic language model we consider. Furthermore, we identify the phase transition as a variant of the Berezinskii--Kosterlitz--Thouless (BKT) transition, which is known to exhibit critical properties not only at the transition point but also in the entire phase. This finding leads to the possibility that critical properties in natural languages may not require careful fine-tuning nor self-organized criticality, but are generically explained by the underlying connection between language structures and the BKT phases.

翻译：自然语言中涉及不同统计量的若干幂律临界特性——令人联想到物理系统在相变点或附近的标度特性——已被记录数十年。近期大型语言模型的兴起，通过提供与物理学中诸如标度律和涌现能力等概念的引人入胜的相似性，增添了进一步的证据和兴奋点。然而，目前尚缺乏被统计物理学界所理解的、展现出相变现象的生成式语言模型类别的具体实例。在本工作中，受统计物理学中一维Potts模型的启发，我们构建了一个简单的概率语言模型，它属于语境敏感文法类别，我们称之为语境敏感随机语言模型，并在自然语言模型的框架内数值地证明了一个明确的相变。我们明确展示，一个精确定义的有序参量——它捕捉了语言模型生成的句子中的符号频率偏差——从严格为零变为严格非零值（在句子无限长的极限下），这意味着当我们调节所考虑的随机语言模型的参数时，会出现一个数学奇点。此外，我们将该相变识别为Berezinskii--Kosterlitz--Thouless（BKT）相变的一个变体，已知该相变不仅在转变点，而且在整个相中都表现出临界特性。这一发现表明，自然语言中的临界特性可能不需要精心的微调或自组织临界性，而是可以通过语言结构与BKT相之间的潜在联系得到一般性解释。