The Statistical Signature of LLMs

Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs. Grokipedia), and fully synthetic social interaction environments (Moltbook vs. Reddit). Across settings, compression reveals a persistent structural signature of probabilistic generation. In controlled and mediated contexts, LLM-produced language exhibits higher structural regularity and compressibility than human-written text, consistent with a concentration of output within highly recurrent statistical patterns. However, this signature shows scale dependence: in fragmented interaction environments the separation attenuates, suggesting a fundamental limit to surface-level distinguishability at small scales. This compressibility-based separation emerges consistently across models, tasks, and domains and can be observed directly from surface text without relying on model internals or semantic evaluation. Overall, our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production, offering a structural perspective on the evolving complexity of communication.

翻译：大型语言模型通过从高维分布中进行概率采样来生成文本，然而这一过程如何重塑语言的结构统计组织仍未得到完整刻画。本文表明，无损压缩提供了一种简单、模型无关的统计规律性度量方法，能够直接从表层文本区分生成机制。我们分析了三种渐进复杂的信息生态系统中的压缩行为：受控的人机续写任务、知识基础设施的生成式中介（维基百科与Grokipedia对比），以及完全合成的社交互动环境（Moltbook与Reddit对比）。在所有场景中，压缩均揭示了概率生成过程存在的持续结构特征。在受控与中介语境下，LLM生成的语言比人类撰写的文本表现出更高的结构规律性与可压缩性，这与输出集中于高度重复的统计模式的现象一致。然而，该特征呈现尺度依赖性：在碎片化的互动环境中，区分度逐渐减弱，表明在小尺度上存在表层可区分性的根本极限。这种基于可压缩性的区分在不同模型、任务与领域中均稳定出现，且无需依赖模型内部结构或语义评估即可直接从表层文本观测到。总体而言，我们的研究提出了一个简单而稳健的量化框架，用于揭示生成式系统如何重塑文本生产，从而为理解通信演化的复杂性提供了结构视角。