We explore the impact of pre-training data composition on the performance of small language models in a sample-efficient setting. Using datasets limited to 10 million words, we evaluate several dataset sources, including child-directed speech (CHILDES), classic books (Gutenberg), synthetic data (TinyStories), and a mix of these (Mix) across different model sizes ranging from 18 million to 705 million parameters. Our experiments show that smaller models (e.g., GPT2-97M, GPT2-705M, Llama-360M) perform better when trained on more complex and rich datasets like Gutenberg. Models trained on the CHILDES and TinyStories datasets underperformed across all model sizes. These findings suggest that the optimal dataset for sample efficient training depends on the model size, and that neither child-directed speech nor simplified stories are optimal for language models of all sizes. We highlight the importance of considering both dataset composition and model capacity for effective sample efficient language model training.
翻译:本研究探讨了在样本高效设置下,预训练数据构成对小规模语言模型性能的影响。我们使用总词汇量限制在1000万词的数据集,评估了包括儿童导向言语(CHILDES)、经典书籍(古腾堡计划)、合成数据(TinyStories)以及这些数据的混合(Mix)在内的多种数据源,测试模型参数量范围从1800万到7.05亿不等。实验结果表明,较小模型(如GPT2-97M、GPT2-705M、Llama-360M)在古腾堡计划这类更复杂丰富的数据集上训练时表现更佳。使用CHILDES和TinyStories数据集训练的模型在所有规模下均表现欠佳。这些发现表明,样本高效训练的最佳数据集取决于模型规模,儿童导向言语和简化故事均非适用于所有规模语言模型的最佳选择。我们强调,为实现有效的样本高效语言模型训练,必须同时考虑数据构成与模型容量。