We prove theoretically that generalization improves not only through data scaling but also by compressing internal representations. To operationalize this insight, we introduce the Information Bottleneck Language Modeling (IBLM) objective, which reframes language modeling as a constrained optimization problem: minimizing representation entropy subject to optimal prediction performance. Empirically, we observe an emergent memorization-compression cycle during LLM pretraining, evidenced by oscillation positive/negative gradient alignment between cross-entropy and Matrix-Based Entropy (MBE), a measure of representation entropy. This pattern closely mirrors the predictive-compressive trade-off prescribed by IBLM and also parallels the biological alternation between awake learning and sleep consolidation. Motivated by this observation, we propose Gated Phase Transition (GAPT), a training algorithm that adaptively switches between memorization and compression phases. When applied to GPT-2 pretraining on FineWeb dataset, GAPT reduces MBE by 50% and improves cross-entropy by 4.8%. GAPT improves OOD generalizatino by 35% in a pretraining task on arithmetic multiplication. In a setting designed to simulate catastrophic forgetting, GAPT reduces interference by compressing and separating representations, achieving a 97% improvement in separation - paralleling the functional role of sleep consolidation.
翻译:我们从理论上证明,泛化能力的提升不仅依赖于数据规模的扩大,还源于内部表示的压缩。为实践这一洞见,我们引入了信息瓶颈语言建模(IBLM)目标,该目标将语言建模重新定义为约束优化问题:在保持最优预测性能的前提下,最小化表示的熵。实证研究表明,在大语言模型预训练过程中会出现一种自发的记忆-压缩循环,其证据是交叉熵与矩阵基熵(MBE,一种表示熵的度量)之间梯度对齐方向的正负振荡。这一模式与IBLM所规定的预测-压缩权衡紧密对应,同时也与生物学习中清醒学习与睡眠巩固的交替过程相似。受此观察启发,我们提出了门控相变(GAPT)训练算法,该算法能自适应地在记忆阶段与压缩阶段之间切换。在FineWeb数据集上对GPT-2进行预训练时,GAPT将MBE降低了50%,并将交叉熵提升了4.8%。在算术乘法预训练任务中,GAPT将分布外泛化能力提升了35%。在模拟灾难性遗忘的场景中,GAPT通过压缩和分离表示来减少干扰,实现了97%的分离度提升——这与睡眠巩固的功能作用相类似。