Evaluating whether large language models (LLMs) capture the structure of natural language beyond local fluency remains an open challenge. Existing evaluation methods, largely based on task performance or short-context behavior, provide limited insight into the long-range statistical organization of generated text. We propose a complementary evaluation framework based on repeated subsequences. By analyzing their distribution across scales and relating it to higher-order Rényi entropies, we probe how texts reuse previously established structure under finite-length conditions. Experiments on human-written texts and length-matched GPT-generated texts show that, while power-law models can describe restricted ranges of block length, the observed entropy growth is often equally or better characterized by logarithmic--power forms. Across datasets, natural language exhibits stable entropy-growth patterns over accessible ranges, with consistent average behavior despite variability across individual texts. In contrast, GPT-generated texts show systematic and statistically significant shifts in estimated exponents with model size. These results demonstrate that repeated-subsequence entropy provides a quantitative structural diagnostic that reveals systematic differences in long-range organization, distinguishing natural language from state-of-the-art LLM outputs beyond surface-level fluency.
翻译:评估大语言模型(LLM)是否捕捉到超越局部流畅性的自然语言结构,仍是一项开放的挑战。现有评估方法主要基于任务性能或短上下文行为,对生成文本的长程统计组织提供了有限的见解。我们提出了一种基于重复子序列的互补评估框架。通过分析其跨尺度的分布并将其与高阶Rényi熵相关联,我们探究了文本在有限长度条件下如何重用先前建立的结构。对人工撰写文本与长度匹配的GPT生成文本的实验表明,虽然幂律模型能描述有限范围内的块长度,但观察到的熵增长通常同样或更好地由对数-幂律形式刻画。在不同数据集中,自然语言在可访问范围内表现出稳定的熵增长模式,尽管个体文本间存在变异性,但其平均行为保持一致。相比之下,GPT生成文本的估计指数随模型规模表现出系统且统计显著的偏移。这些结果表明,重复子序列熵提供了一种定量结构诊断方法,揭示了长程组织中的系统性差异,能够在表面流畅性之外区分自然语言与最先进的LLM输出。