This paper empirically investigates the relationship between subword vocabulary size and the performance of large language models (LLMs) to provide insights on how to define the vocabulary size. Experimental results show that larger vocabulary sizes lead to better performance in LLMs. Moreover, we consider a continual training scenario where a pre-trained language model is trained on a different target language. We introduce a simple method to use a new vocabulary instead of the pre-defined one. We show that using the new vocabulary outperforms the model with the vocabulary used in pre-training.
翻译:本文通过实证研究探究了子词词汇表规模与大型语言模型(LLMs)性能之间的关系,旨在为如何确定词汇表规模提供见解。实验结果表明,更大的词汇表规模能够带来LLMs性能的提升。此外,我们考虑了一种持续训练场景,即在预训练语言模型上针对不同目标语言进行训练。我们引入了一种简单的方法来使用新词汇表替代预定义的词汇表。实验证明,使用新词汇表的模型性能优于沿用预训练阶段词汇表的模型。