Harnessing Diversity for Important Data Selection in Pretraining Large Language Models

Data selection is of great significance in pre-training large language models, given the variation in quality within the large-scale available training corpora. To achieve this, researchers are currently investigating the use of data influence to measure the importance of data instances, $i.e.,$ a high influence score indicates that incorporating this instance to the training set is likely to enhance the model performance. Consequently, they select the top-$k$ instances with the highest scores. However, this approach has several limitations. (1) Computing the influence of all available data is time-consuming. (2) The selected data instances are not diverse enough, which may hinder the pre-trained model's ability to generalize effectively to various downstream tasks. In this paper, we introduce \texttt{Quad}, a data selection approach that considers both quality and diversity by using data influence to achieve state-of-the-art pre-training results. In particular, noting that attention layers capture extensive semantic details, we have adapted the accelerated $iHVP$ computation methods for attention layers, enhancing our ability to evaluate the influence of data, $i.e.,$ its quality. For the diversity, \texttt{Quad} clusters the dataset into similar data instances within each cluster and diverse instances across different clusters. For each cluster, if we opt to select data from it, we take some samples to evaluate the influence to prevent processing all instances. To determine which clusters to select, we utilize the classic Multi-Armed Bandit method, treating each cluster as an arm. This approach favors clusters with highly influential instances (ensuring high quality) or clusters that have been selected less frequently (ensuring diversity), thereby well balancing between quality and diversity.

翻译：在大规模可用训练语料库中，数据质量存在差异，因此数据选择对于预训练大规模语言模型具有重要意义。为实现这一目标，研究人员目前正在探索利用数据影响力来衡量数据实例的重要性，即高影响力分数表明将该实例纳入训练集可能提升模型性能。因此，他们选择得分最高的前$k$个实例。然而，这种方法存在若干局限性：(1) 计算所有可用数据的影响力耗时严重；(2) 所选数据实例多样性不足，可能阻碍预训练模型有效泛化至各类下游任务。本文提出\texttt{Quad}方法，这是一种通过利用数据影响力同时兼顾质量与多样性的数据选择方法，旨在实现最先进的预训练效果。具体而言，注意到注意力层能够捕获丰富的语义细节，我们针对注意力层改进了加速的$iHVP$计算方法，从而增强了对数据影响力（即其质量）的评估能力。在多样性方面，\texttt{Quad}将数据集聚类，使得每个簇内包含相似的数据实例，而不同簇间的实例则具有差异性。对于每个簇，若决定从中选择数据，我们仅采样部分实例评估影响力，以避免处理全部数据。为确定选择哪些簇，我们采用经典的多臂赌博机方法，将每个簇视为一个臂。该策略倾向于选择包含高影响力实例（确保高质量）或先前被选择频率较低（确保多样性）的簇，从而在质量与多样性之间实现良好平衡。