Scaling laws are a critical component of the LLM development pipeline, most famously as a way to forecast training decisions such as 'compute-optimally' trading-off parameter count and dataset size, alongside a more recent growing list of other crucial decisions. In this work, we ask whether compute-optimal scaling behaviour can be skill-dependent. In particular, we examine knowledge and reasoning-based skills such as knowledge-based QA and code generation, and we answer this question in the affirmative: $\textbf{scaling laws are skill-dependent}$. Next, to understand whether skill-dependent scaling is an artefact of the pretraining datamix, we conduct an extensive ablation of different datamixes and find that, also when correcting for datamix differences, $\textbf{knowledge and code exhibit fundamental differences in scaling behaviour}$. We conclude with an analysis of how our findings relate to standard compute-optimal scaling using a validation set, and find that $\textbf{a misspecified validation set can impact compute-optimal parameter count by nearly 50%,}$ depending on its skill composition.
翻译:缩放定律是大型语言模型开发流程中的关键组成部分,最著名的是作为一种预测训练决策(例如“计算最优地”权衡参数数量与数据集大小)的方法,同时近年来其他关键决策的列表也在不断增长。在本研究中,我们探讨计算最优的缩放行为是否可能依赖于具体技能。特别地,我们考察了基于知识的技能(如知识问答)和基于推理的技能(如代码生成),并对此问题给出了肯定的答案:**缩放定律是技能依赖的**。接着,为了理解技能依赖的缩放是否是预训练数据混合比例的人为假象,我们对不同的数据混合比例进行了广泛的消融实验,发现即使在纠正数据混合比例的差异后,**知识与代码在缩放行为上仍表现出根本性差异**。最后,我们分析了我们的发现与使用验证集进行标准计算最优缩放之间的关系,并发现**一个设定不当的验证集可能使计算最优参数数量产生近50%的偏差**,具体偏差程度取决于其技能构成。