Small Language models (SLMs) offer an efficient and accessible alternative to Large Language Models (LLMs), delivering strong performance while using far fewer resources. We introduce a simple and effective framework for pretraining SLMs that brings together three complementary ideas. First, we identify structurally sparse sub-network initializations that consistently outperform randomly initialized models of similar size under the same compute budget. Second, we use evolutionary search to automatically discover high-quality sub-network initializations, providing better starting points for pretraining. Third, we apply knowledge distillation from larger teacher models to speed up training and improve generalization. Together, these components make SLM pretraining substantially more efficient: our best model, discovered using evolutionary search and initialized with LLM weights, matches the validation perplexity of a comparable Pythia SLM while requiring 5.16x and 1.26x fewer floating point operations for token budgets of 10B and 100B, respectively. We release all code publicly, offering a practical and reproducible path toward cost-efficient small language model development at scale.
翻译:小型语言模型(SLMs)为大型语言模型(LLMs)提供了一种高效且易于获取的替代方案,在显著减少资源消耗的同时仍能提供强劲的性能。本文提出一个简洁而有效的小型语言模型预训练框架,该框架融合了三个互补的思想。首先,我们识别出结构稀疏的子网络初始化方案,在相同计算预算下,其性能始终优于同等规模的随机初始化模型。其次,我们利用进化搜索自动发现高质量的子网络初始化方案,为预训练提供更优的起点。第三,我们应用来自更大教师模型的知识蒸馏来加速训练并提升泛化能力。这些组件共同作用,显著提升了SLM预训练的效率:我们通过进化搜索发现并使用LLM权重初始化的最佳模型,在验证困惑度上达到了可比Pythia SLM的水平,同时在10B和100B词元预算下分别仅需5.16倍和1.26倍更少的浮点运算量。我们已公开所有代码,为规模化开发高性价比的小型语言模型提供了一条实用且可复现的路径。