Long Context Language Models have drawn great attention in the past few years. There has been work discussing the impact of long context on Language Model performance: some find that long irrelevant context could harm performance, while some experimentally summarize loss reduction by relevant long context as Scaling Laws. This calls for a more thorough understanding of how long context impacts Language Modeling. In this work, we (1) propose to use `Intrinsic Entropy' for explaining the impact of context length on language modeling; and (2) conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions. Our theoretical framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain cases. We hope our work may inspire new long context Language Models, as well as future work studying the physics of Language Models.
翻译:长上下文语言模型在过去几年中引起了广泛关注。已有研究探讨长上下文对语言模型性能的影响:部分研究发现不相关的长上下文可能损害模型性能,而另一些研究则通过实验总结出相关长上下文带来的损失降低遵循缩放定律。这要求我们对长上下文如何影响语言建模建立更深入的理解。本研究中,我们(1)提出使用"本征熵"来解释上下文长度对语言建模的影响;(2)在自然语言和合成数据上进行实验,验证我们提出的理论假设与推论。我们的理论框架能够提供实践洞见,例如证明训练数据集规模决定了最优上下文长度,并在特定情况下界定了上下文长度缩放的上限。我们希望这项工作能够启发新型长上下文语言模型的开发,并为未来研究语言模型的物理特性提供新思路。