The effectiveness of large language models (LLMs) is often hindered by duplicated data in their extensive pre-training datasets. Current approaches primarily focus on detecting and removing duplicates, which risks the loss of valuable information and neglects the varying degrees of duplication. To address this, we propose a soft deduplication method that maintains dataset integrity while selectively reducing the sampling weight of data with high commonness. Central to our approach is the concept of "data commonness", a metric we introduce to quantify the degree of duplication by measuring the occurrence probabilities of samples using an n-gram model. Empirical analysis shows that this method significantly improves training efficiency, achieving comparable perplexity scores with at least a 26% reduction in required training steps. Additionally, it enhances average few-shot downstream accuracy by 1.77% when trained for an equivalent duration. Importantly, this approach consistently improves performance, even on rigorously deduplicated datasets, indicating its potential to complement existing methods and become a standard pre-training process for LLMs.
翻译:大型语言模型(LLM)的性能常受其大规模预训练数据集中重复数据的制约。现有方法主要集中于检测并移除重复数据,但这可能导致有价值信息的丢失,且忽略了不同程度的重复性。为解决这一问题,我们提出了一种软去重方法,该方法在保持数据集完整性的同时,有选择地降低高"共性"数据的采样权重。我们方法的核心是"数据共性"这一概念,我们引入该指标以通过n-gram模型测量样本的出现概率来量化重复程度。实证分析表明,该方法显著提升了训练效率,在所需训练步数减少至少26%的情况下,达到了可比的困惑度分数。此外,在同等训练时长下,下游任务的平均少样本准确率提升了1.77%。重要的是,即使在经过严格去重的数据集上,该方法也能持续提升性能,这表明其有潜力作为现有方法的补充,并成为LLM预训练的标准流程。