Coreset selection seeks to choose a subset of crucial training samples for efficient learning. It has gained traction in deep learning, particularly with the surge in training dataset sizes. Sample selection hinges on two main aspects: a sample's representation in enhancing performance and the role of sample diversity in averting overfitting. Existing methods typically measure both the representation and diversity of data based on similarity metrics, such as L2-norm. They have capably tackled representation via distribution matching guided by the similarities of features, gradients, or other information between data. However, the results of effectively diverse sample selection are mired in sub-optimality. This is because the similarity metrics usually simply aggregate dimension similarities without acknowledging disparities among the dimensions that significantly contribute to the final similarity. As a result, they fall short of adequately capturing diversity. To address this, we propose a feature-based diversity constraint, compelling the chosen subset to exhibit maximum diversity. Our key lies in the introduction of a novel Contributing Dimension Structure (CDS) metric. Different from similarity metrics that measure the overall similarity of high-dimensional features, our CDS metric considers not only the reduction of redundancy in feature dimensions, but also the difference between dimensions that contribute significantly to the final similarity. We reveal that existing methods tend to favor samples with similar CDS, leading to a reduced variety of CDS types within the coreset and subsequently hindering model performance. In response, we enhance the performance of five classical selection methods by integrating the CDS constraint. Our experiments on three datasets demonstrate the general effectiveness of the proposed method in boosting existing methods.
翻译:核心集选择旨在选取关键训练样本的子集以实现高效学习。随着训练数据集规模的激增,该方法在深度学习中备受关注。样本选择依赖于两个主要方面:样本对提升性能的表征能力,以及样本多样性对防止过拟合的作用。现有方法通常基于相似度度量(如L2范数)同时评估数据的表征性与多样性,并已能够通过基于特征、梯度等数据间相似性的分布匹配有效处理表征问题。然而,有效多样性样本选择的结果仍陷于次优状态。这是因为相似度度量通常简单聚合维度相似性,未识别出对最终相似性具有显著贡献的维度差异,导致无法充分捕捉多样性。为解决这一问题,我们提出基于特征的多样性约束,迫使所选子集展现最大多样性。其关键在于引入新型贡献维度结构度量。与衡量高维特征整体相似性的传统度量不同,CDS度量不仅考虑特征维度的冗余消减,还关注对最终相似性贡献显著的维度差异。我们揭示出现有方法倾向于选择具有相似CDS的样本,导致核心集中CDS类型减少,进而阻碍模型性能。为此,我们通过整合CDS约束增强五种经典选择方法的性能。在三个数据集上的实验表明,所提方法对提升现有方法具有普适有效性。