On the Entropy Calibration of Language Models

We study the problem of entropy calibration, which asks whether a language model's entropy over generations matches its log loss on human text. Past work found that models are miscalibrated, with entropy per step increasing as generations grow longer, due to error accumulation. To calibrate the model and improve text quality, it has become standard practice to truncate the distribution, but this approach reduces output diversity, which we would like to avoid. Therefore, in this paper, we ask: does miscalibration improve automatically with scale, and if not, is it theoretically possible to calibrate without tradeoffs? To build intuition, we first study a simplified theoretical setting to characterize the scaling behavior of miscalibration with respect to dataset size. We find that the rate of scaling depends on the power law exponent of the data distribution -- in particular, for a power law exponent close to 1, the scaling exponent is close to 0, meaning that miscalibration improves very slowly with scale. Next, we measure miscalibration empirically in language models ranging from 0.5B to 70B parameters. We find that the observed scaling behavior is similar to what is predicted theoretically: our fitted scaling exponents for text are close to 0, meaning that larger models accumulate error at a similar rate as smaller ones. This scaling (or, lack thereof) provides one explanation for why we sample from larger models with similar amounts of truncation as smaller models, even though the larger models are of higher quality. However, truncation is not a satisfying solution because it comes at the cost of increased log loss. In theory, is it even possible to reduce entropy while preserving log loss? We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.

翻译：本文研究熵校准问题，即探究语言模型在生成文本时的熵是否与其在人类文本上的对数损失相匹配。已有研究发现模型存在校准偏差，由于误差累积效应，随着生成文本长度的增加，每一步的熵会持续上升。为校准模型并提升文本质量，截断概率分布已成为标准实践方法，但这种方法会降低输出多样性，而这是我们希望避免的。因此，本文提出两个核心问题：校准偏差是否会随模型规模扩大而自动改善？若不能，理论上是否存在无需权衡的校准方法？为建立理论直觉，我们首先在简化理论框架下研究校准偏差随数据集规模的标度行为。研究发现，标度速率取决于数据分布幂律指数的特性——当幂律指数接近1时，标度指数趋近于0，这意味着校准偏差随规模扩大的改善极为缓慢。随后，我们对参数量从0.5B到70B的语言模型进行了实证校准偏差测量。观测到的标度行为与理论预测高度吻合：文本数据的拟合标度指数接近0，表明大模型与小模型具有相似的误差累积速率。这种标度特性（或缺乏标度性）解释了为何我们对大模型采样时仍需采用与小模型相近的截断量，尽管大模型本身具有更优的生成质量。然而截断法并非理想解决方案，因其会导致对数损失增加。从理论层面探讨：在保持对数损失不变的前提下降低熵是否可能？我们证明，若假设存在能够预测文本未来熵的黑盒模型拟合器，则该目标在理论上是可实现的。