We show that the laws of autocorrelations decay in texts are closely related to applicability limits of language models. Using distributional semantics we empirically demonstrate that autocorrelations of words in texts decay according to a power law. We show that distributional semantics provides coherent autocorrelations decay exponents for texts translated to multiple languages. The autocorrelations decay in generated texts is quantitatively and often qualitatively different from the literary texts. We conclude that language models exhibiting Markov behavior, including large autoregressive language models, may have limitations when applied to long texts, whether analysis or generation.
翻译:我们证明文本中自相关衰减的规律与语言模型的适用极限密切相关。利用分布语义学,我们通过实证表明文本中词汇的自相关遵循幂律衰减。我们发现,分布语义学为翻译成多种语言的文本提供了连贯的自相关衰减指数。生成的文本中的自相关衰减在数量上(通常也在质量上)与文学文本不同。我们得出结论,表现出马尔可夫行为的语言模型,包括大型自回归语言模型,在应用于长文本(无论是分析还是生成)时可能存在局限性。