We propose a new asymptotic equipartition property for the perplexity of a large piece of text generated by a language model and present theoretical arguments for this property. Perplexity, defined as a inverse likelihood function, is widely used as a performance metric for training language models. Our main result states that the logarithmic perplexity of any large text produced by a language model must asymptotically converge to the average entropy of its token distributions. This means that language models are constrained to only produce outputs from a ``typical set", which we show, is a vanishingly small subset of all possible grammatically correct outputs. We present preliminary experimental results from an open-source language model to support our theoretical claims. This work has possible practical applications for understanding and improving ``AI detection" tools and theoretical implications for the uniqueness, predictability and creative potential of generative models.
翻译:本文针对语言模型生成的大规模文本的困惑度提出了一种新的渐近均分性质,并为此性质提供了理论论证。困惑度作为一种逆似然函数,被广泛用作训练语言模型的性能指标。我们的主要结果表明,语言模型生成的任何大规模文本的对数困惑度必须渐近收敛于其词元分布的平均熵。这意味着语言模型被约束为仅能产生来自"典型集"的输出,而我们证明该典型集在所有可能语法正确输出中是一个趋近于零的子集。我们通过开源语言模型的初步实验结果来支持理论主张。这项工作对于理解和改进"AI检测"工具具有潜在的实际应用价值,并对生成模型的唯一性、可预测性和创造潜力具有理论意义。