Large language models (LLMs) have demonstrated exceptional performance across various natural language processing tasks. However, the conventional fixed-length data composition strategy for pretraining, which involves concatenating and splitting documents, can introduce noise and limit the model's ability to capture long-range dependencies. To address this, we first introduce three metrics for evaluating data composition quality: padding ratio, truncation ratio, and concatenation ratio. We further propose a multi-bucket data composition method that moves beyond the fixed-length paradigm, offering a more flexible and efficient approach to pretraining. Extensive experiments demonstrate that our proposed method could significantly improving both the efficiency and efficacy of LLMs pretraining. Our approach not only reduces noise and preserves context but also accelerates training, making it a promising solution for LLMs pretraining.
翻译:大型语言模型(LLMs)在各种自然语言处理任务中展现出卓越性能。然而,传统的固定长度数据组合策略(涉及文档拼接与分割)会引入噪声,并限制模型捕获长距离依赖的能力。为此,我们首先提出了三个评估数据组合质量的指标:填充率、截断率与拼接率。我们进一步提出了一种超越固定长度范式的多桶数据组合方法,为预训练提供了更灵活高效的途径。大量实验表明,我们提出的方法能显著提升LLMs预训练的效率和效果。该方法不仅降低了噪声并保留了上下文,还加速了训练过程,为LLMs预训练提供了一个极具前景的解决方案。