Scaling laws for large language models (LLMs) predict model performance based on parameters like size and training data. However, differences in training configurations and data processing across model families lead to significant variations in benchmark performance, making it difficult for a single scaling law to generalize across all LLMs. On the other hand, training family-specific scaling laws requires training models of varying sizes for every family. In this work, we propose Skills Scaling Laws (SSLaws, pronounced as Sloth), a novel scaling law that leverages publicly available benchmark data and assumes LLM performance is driven by low-dimensional latent skills, such as reasoning and instruction following. These latent skills are influenced by computational resources like model size and training tokens but with varying efficiencies across model families. Sloth exploits correlations across benchmarks to provide more accurate and interpretable predictions while alleviating the need to train multiple LLMs per family. We present both theoretical results on parameter identification and empirical evaluations on 12 prominent benchmarks, from Open LLM Leaderboard v1/v2, demonstrating that Sloth predicts LLM performance efficiently and offers insights into scaling behaviors for downstream tasks such as coding and emotional intelligence applications.
翻译:大语言模型(LLM)的扩展定律通过模型规模与训练数据等参数预测模型性能。然而,不同模型家族在训练配置与数据处理上的差异会导致基准测试性能的显著波动,使得单一的扩展定律难以泛化至所有大语言模型。另一方面,为每个模型家族训练特定的扩展定律需要对不同规模的模型进行重复训练。本研究提出技能扩展定律(SSLaws,简称Sloth),这是一种新型扩展定律,其利用公开可得的基准测试数据,并假设大语言模型的性能由低维潜在技能(如推理与指令遵循)驱动。这些潜在技能受模型规模与训练词元等计算资源影响,但不同模型家族的影响效率存在差异。Sloth通过挖掘基准测试间的相关性,在无需为每个家族训练多个大语言模型的前提下,提供更精准且可解释的性能预测。我们提出了参数识别的理论结果,并在Open LLM Leaderboard v1/v2的12个重要基准上进行了实证评估,结果表明Sloth能高效预测大语言模型性能,并为代码生成与情感智能应用等下游任务的扩展行为提供深入洞见。