Improvements in language model capabilities are often attributed to increasing model size or training data, but in some cases smaller models trained on curated data or with different architectural decisions can outperform larger ones trained on more tokens. What accounts for this? To quantify the impact of these design choices, we meta-analyze 92 open-source pretrained models across a wide array of scales, including state-of-the-art open-weights models as well as less performant models and those with less conventional design decisions. We find that by incorporating features besides model size and number of training tokens, we can achieve a relative 3-28% increase in ability to predict downstream performance compared with using scale alone. Analysis of model design decisions reveal insights into data composition, such as the trade-off between language and code tasks at 15-25\% code, as well as the better performance of some architectural decisions such as choosing rotary over learned embeddings. Broadly, our framework lays a foundation for more systematic investigation of how model development choices shape final capabilities.
翻译:语言模型能力的提升通常归因于模型规模或训练数据的增加,但在某些情况下,经过精选数据训练或采用不同架构决策的较小模型,其表现可能优于使用更多训练标记训练的大型模型。这背后的原因是什么?为了量化这些设计选择的影响,我们对92个开源预训练模型进行了元分析,这些模型涵盖了广泛的规模范围,包括最先进的开源权重模型、性能较弱的模型以及采用非常规设计决策的模型。研究发现,除了模型规模和训练标记数量之外,通过纳入其他特征,我们在预测下游性能的能力上,相比仅使用规模指标,可以实现相对3-28%的提升。对模型设计决策的分析揭示了数据构成方面的洞见,例如在15-25%代码比例下语言任务与代码任务之间的权衡,以及某些架构决策(如选择旋转位置编码而非学习式位置编码)带来的更优性能。总体而言,我们的框架为更系统地研究模型开发决策如何塑造最终能力奠定了基础。