Data is the cornerstone of large language models (LLMs), but not all data is useful for model learning. Carefully selected data can better elicit the capabilities of LLMs with much less computational overhead. Most methods concentrate on evaluating the quality of individual samples in data selection, while the combinatorial effects among samples are neglected. Even if each sample is of perfect quality, their combinations may be suboptimal in teaching LLMs due to their intrinsic homogeneity or contradiction. In this paper, we aim to uncover the underlying relationships between LLM performance and data selection. Inspired by the information compression nature of LLMs, we uncover an ``entropy law'' that connects LLM performance with data compression ratio and first-epoch training loss, which reflect the information redundancy of a dataset and the mastery of inherent knowledge encoded in this dataset, respectively. Through both theoretical deduction and empirical evaluation, we find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss. Based on the findings of the entropy law, we propose a quite efficient and universal data selection method named \textbf{ZIP} for training LLMs, which aim to prioritize data subsets exhibiting a low compression ratio. Based on a multi-stage algorithm that selects diverse data in a greedy manner, we can obtain a good data subset with satisfactory diversity. Extensive experiments have been conducted to validate the entropy law and the superiority of ZIP across different LLM backbones and alignment stages. We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
翻译:数据是大语言模型(LLM)的基石,但并非所有数据都对模型学习有益。经过精心筛选的数据能以更少的计算开销更好地激发LLM的能力。现有方法多集中于评估数据中单个样本的质量,却忽略了样本间的组合效应。即使每个样本质量都极高,由于其内在的同质性或矛盾性,它们的组合在教导LLM时仍可能效果欠佳。本文旨在揭示LLM性能与数据选择之间的深层关系。受LLM信息压缩本质的启发,我们揭示了一条“熵律”,它将LLM性能与数据压缩率及首轮训练损失联系起来——这两者分别反映了数据集的信息冗余度以及模型对该数据集编码的内在知识的掌握程度。通过理论推导与实证评估,我们发现模型性能与训练数据的压缩率呈负相关,而较低的压缩率通常对应更低的训练损失。基于熵律的发现,我们提出了一种高效且通用的数据选择方法**ZIP**,用于LLM训练,其核心是优先选择具有低压缩率的数据子集。通过采用多阶段贪心算法选择多样化数据,我们能够获得具有良好多样性的优质数据子集。大量实验验证了熵律的有效性以及ZIP方法在不同LLM主干模型和对齐阶段上的优越性。我们还展示了熵律的一项有趣应用:它能在模型训练初期检测潜在的性能风险。