Packing and shuffling tokens is a common practice in training auto-regressive language models (LMs) to prevent overfitting and improve efficiency. Typically documents are concatenated to chunks of maximum sequence length (MSL) and then shuffled. However setting the atom size, the length for each data chunk accompanied by random shuffling, to MSL may lead to contextual incoherence due to tokens from different documents being packed into the same chunk. An alternative approach is to utilize padding, another common data packing strategy, to avoid contextual incoherence by only including one document in each shuffled chunk. To optimize both packing strategies (concatenation vs padding), we investigated the optimal atom size for shuffling and compared their performance and efficiency. We found that matching atom size to MSL optimizes performance for both packing methods (concatenation and padding), and padding yields lower final perplexity (higher performance) than concatenation at the cost of more training steps and lower compute efficiency. This trade-off informs the choice of packing methods in training language models.
翻译:在训练自回归语言模型时,对词元进行打包与混排是防止过拟合并提升训练效率的常用方法。通常的做法是将文档拼接至最大序列长度(MSL)的数据块,随后进行混排。然而,若将原子尺寸(即每个伴随随机混排的数据块长度)设置为MSL,可能导致来自不同文档的词元被打包至同一数据块,从而引发上下文不连贯问题。另一种替代方案是采用填充策略——另一种常见的数据打包方法——通过在每个混排块中仅包含单个文档来避免上下文不连贯。为优化两种打包策略(拼接与填充),我们系统研究了混排的最佳原子尺寸,并对比了它们的性能与效率。研究发现:将原子尺寸匹配至MSL可同时优化两种打包方法(拼接与填充)的性能;相较于拼接法,填充策略能以更多训练步骤和更低计算效率为代价,获得更低的最终困惑度(更高性能)。这一权衡关系为语言模型训练中的打包方法选择提供了重要依据。