Scaling the number of parameters and the size of training data has proven to be an effective strategy for improving large language model (LLM) performance. Yet, as these models grow increasingly powerful and widely deployed, the cost of inference has become a pressing concern. Despite its importance, the trade-off between model accuracy and inference efficiency remains underexplored. In this work, we examine how key architectural factors, hidden size, the allocation of parameters between MLP and attention (mlp-to-attention ratio), and grouped-query attention (GQA), influence both inference cost and accuracy. We introduce a conditional scaling law that augments the Chinchilla framework with architectural information, along with a search framework for identifying architectures that are simultaneously inference-efficient and accurate. To validate our approach, we train more than 200 models spanning 80M to 3B parameters and 8B to 100B training tokens, and fit the proposed conditional scaling law. Our results show that the conditional scaling law reliably predicts optimal architectural choices and that the resulting models outperform existing open-source baselines. Under the same training budget, optimized architectures achieve up to 2.1% higher accuracy and 42% greater inference throughput compared to LLaMA-3.2.
翻译:增加参数数量和训练数据规模已被证明是提升大型语言模型(LLM)性能的有效策略。然而,随着这些模型日益强大并被广泛部署,推理成本已成为一个紧迫问题。尽管其重要性不言而喻,模型精度与推理效率之间的权衡仍未得到充分探索。在本研究中,我们探讨了关键架构因素——隐藏层大小、MLP与注意力之间的参数分配(MLP-注意力比例)以及分组查询注意力(GQA)——如何同时影响推理成本和模型精度。我们引入了一个条件缩放定律,该定律在Chinchilla框架中融入了架构信息,并提出了一个用于识别同时具备推理高效性和高精度的架构的搜索框架。为验证我们的方法,我们训练了超过200个模型,参数规模覆盖80M至3B,训练词元数覆盖8B至100B,并拟合了所提出的条件缩放定律。我们的结果表明,该条件缩放定律能够可靠地预测最优架构选择,并且由此得到的模型性能优于现有的开源基线模型。在相同的训练预算下,优化后的架构相比LLaMA-3.2,精度最高可提升2.1%,推理吞吐量最高可提升42%。