Data plays a fundamental role in the training of Large Language Models (LLMs). While attention has been paid to the collection and composition of datasets, determining the data sampling strategy in training remains an open question. Most LLMs are trained with a simple strategy, random sampling. However, this sampling strategy ignores the unbalanced nature of training data distribution, which can be sub-optimal. In this paper, we propose ClusterClip Sampling to balance the text distribution of training data for better model training. Specifically, ClusterClip Sampling utilizes data clustering to reflect the data distribution of the training set and balances the common samples and rare samples during training based on the cluster results. A repetition clip operation is introduced to mitigate the overfitting issue led by samples from certain clusters. Extensive experiments validate the effectiveness of ClusterClip Sampling, which outperforms random sampling and other cluster-based sampling variants under various training datasets and large language models.
翻译:数据在大语言模型的训练中起着基础性作用。尽管数据集收集与组成已受到关注,但训练过程中如何确定数据采样策略仍是一个开放性问题。当前多数大语言模型采用简单的随机采样策略,然而这种采样方式忽略了训练数据分布的不平衡性,可能导致次优结果。本文提出ClusterClip采样方法,通过平衡训练数据的文本分布来优化模型训练。具体而言,ClusterClip采样利用数据聚类反映训练集的数据分布特性,并根据聚类结果在训练过程中平衡常见样本与稀有样本。引入重复截断操作以缓解特定聚类样本导致的过拟合问题。广泛实验验证了ClusterClip采样的有效性,在多种训练数据集和大语言模型上,该方法均优于随机采样及其他基于聚类的采样变体。