Large language models provide a tractable system for asking how intelligence itself emerges, rather than only how LLMs can be engineered. Although progress is usually attributed to scale, data and architecture, we show that parameter initialization is a gene-like determinant of training and, in particular, of model capacity. Reducing the initialization scale consistently improves pretraining, with the largest gains on reasoning-demanding tasks. We identify two widely used empirical settings that restrain the advantage of small initialization, and show how relaxing them restores favorable scaling. We further uncover a critical initialization that balances the reasoning and training. Mechanistically, small initialization drives a distinct developmental trajectory: parameters first condense into low-complexity structures and later expand into richer representations, giving concrete form to the idea that compression is intelligence. Token-level analyses show that the gains concentrate on non-trivial, context-constrained predictions rather than all tokens uniformly. These results motivate a simple $γ$-initialization rule: expose initialization rage as an explicit knob and use small initialization by default, an almost cost-free intervention that improves pretraining and strengthens reasoning across model scales.
翻译:大型语言模型为探究智能如何涌现(而非仅仅如何工程化构建LLM)提供了一个可追踪的系统。尽管进展通常归因于规模、数据和架构,但本文证明,参数初始化是训练(尤其是模型能力)的基因式决定因素。缩小初始化尺度持续改善预训练,其中在需要推理的任务上提升最为显著。我们识别了两种抑制小初始化优势的常见实证设定,并展示如何通过放宽这些设定恢复有利的缩放规律。进一步地,我们揭示了一个平衡推理与训练的关键初始化值。从机制上看,小初始化驱动了一种独特的发展轨迹:参数首先凝聚为低复杂度结构,随后扩展为更丰富的表征,这具体化了“压缩即智能”的理念。词元级分析表明,增益集中于非平凡、上下文受限的预测,而非均匀地作用于所有词元。这些结果催生了一个简单的γ-初始化规则:将初始化范围作为显式调节旋钮,并默认采用小初始化——这一近乎无成本的干预措施,能在不同模型规模下改善预训练并强化推理能力。