Tokenization, a crucial initial step in natural language processing, is often assumed to benefit from larger training datasets. This paper investigates the impact of tokenizer training data sizes ranging from 1GB to 900GB. Our findings reveal diminishing returns as the data size increases, highlighting a practical limit on how much further scaling the training data can improve tokenization quality. We analyze this phenomenon and attribute the saturation effect to the constraints imposed by the pre-tokenization stage of tokenization. These results offer valuable insights for optimizing the tokenization process and highlight potential avenues for future research in tokenization algorithms.
翻译:分词作为自然语言处理的关键初始步骤,通常被认为会从更大的训练数据集中受益。本文研究了训练数据规模从1GB到900GB对分词器训练的影响。我们的研究结果表明,随着数据规模的增加,其带来的收益呈现递减趋势,这凸显了通过扩大训练数据规模来提升分词质量存在实际限度。我们分析了这一现象,并将饱和效应归因于分词过程中预分词阶段所施加的约束。这些结果为优化分词过程提供了有价值的见解,并指明了分词算法未来研究的潜在方向。