Research on scaling large language models (LLMs) has primarily focused on model parameters and training data size, overlooking the role of vocabulary size. We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We propose three complementary approaches for predicting the compute-optimal vocabulary size: IsoFLOPs analysis, derivative estimation, and parametric fit of the loss function. Our approaches converge on the conclusion that the optimal vocabulary size depends on the compute budget, with larger models requiring larger vocabularies. Most LLMs, however, use insufficient vocabulary sizes. For example, we predict that the optimal vocabulary size of Llama2-70B should have been at least 216K, 7 times larger than its vocabulary of 32K. We validate our predictions empirically by training models with 3B parameters across different FLOPs budgets. Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes. By increasing the vocabulary size from the conventional 32K to 43K, we improve performance on ARC-Challenge from 29.1 to 32.0 with the same 2.3e21 FLOPs. Our work highlights the importance of jointly considering tokenization and model scaling for efficient pre-training. The code and demo are available at https://github.com/sail-sg/scaling-with-vocab and https://hf.co/spaces/sail/scaling-with-vocab-demo.
翻译:大型语言模型(LLM)的缩放研究主要集中于模型参数量和训练数据规模,却忽视了词汇表规模的作用。本研究通过训练参数量从3300万到30亿不等的模型(训练字符量高达5000亿,采用多种词汇表配置),探究了词汇表规模如何影响LLM缩放定律。我们提出三种互补的方法来预测计算最优的词汇表规模:等FLOPs分析、导数估计以及损失函数的参数拟合。这些方法共同表明:最优词汇表规模取决于计算预算,更大规模的模型需要更大的词汇表。然而,当前大多数LLM使用的词汇表规模均显不足。例如,我们预测Llama2-70B的最优词汇表规模应至少为21.6万,是其实际使用词汇表规模(3.2万)的7倍。我们通过在不同FLOPs预算下训练30亿参数模型,对预测结果进行了实证验证。采用我们预测的最优词汇表规模,相比常用词汇表规模,在下游任务中 consistently 带来性能提升。例如,在相同2.3e21 FLOPs计算量下,将词汇表规模从常规的3.2万提升至4.3万,可使ARC-Challenge数据集上的性能从29.1提升至32.0。本研究强调了在高效预训练中需要联合考虑词元化与模型缩放的重要性。代码与演示可在 https://github.com/sail-sg/scaling-with-vocab 和 https://hf.co/spaces/sail/scaling-with-vocab-demo 获取。