Coreset selection seeks to choose a subset of crucial training samples for efficient learning. It has gained traction in deep learning, particularly with the surge in training dataset sizes. Sample selection hinges on two main aspects: a sample's representation in enhancing performance and the role of sample diversity in averting overfitting. Existing methods typically measure both the representation and diversity of data based on similarity metrics, such as L2-norm. They have capably tackled representation via distribution matching guided by the similarities of features, gradients, or other information between data. However, the results of effectively diverse sample selection are mired in sub-optimality. This is because the similarity metrics usually simply aggregate dimension similarities without acknowledging disparities among the dimensions that significantly contribute to the final similarity. As a result, they fall short of adequately capturing diversity. To address this, we propose a feature-based diversity constraint, compelling the chosen subset to exhibit maximum diversity. Our key lies in the introduction of a novel Contributing Dimension Structure (CDS) metric. Different from similarity metrics that measure the overall similarity of high-dimensional features, our CDS metric considers not only the reduction of redundancy in feature dimensions, but also the difference between dimensions that contribute significantly to the final similarity. We reveal that existing methods tend to favor samples with similar CDS, leading to a reduced variety of CDS types within the coreset and subsequently hindering model performance. In response, we enhance the performance of five classical selection methods by integrating the CDS constraint. Our experiments on three datasets demonstrate the general effectiveness of the proposed method in boosting existing methods.
翻译:核心集选择旨在从训练样本中挑选出关键子集,以实现高效学习。该方法在深度学习中广受关注,尤其是在训练数据集规模激增的背景下。样本选择主要依赖两个关键方面:样本对提升性能的表征能力,以及样本多样性对防止过拟合的作用。现有方法通常基于相似性度量(如L2范数)来衡量数据的表征性和多样性。它们通过特征、梯度或其他数据信息的相似性分布匹配,已能有效处理表征性问题。然而,在实现高效多样本选择时,现有方法往往陷入次优困境。这是因为相似性度量通常简单聚合各维度的相似性,而未区分对最终相似度有显著贡献的维度差异。因此,它们难以充分捕捉数据多样性。为解决这一问题,我们提出基于特征的多样性约束,迫使所选子集呈现最大多样性。其核心在于引入新型贡献维度结构(CDS)度量。与直接度量高维特征整体相似性的传统方法不同,CDS度量不仅考虑特征维度的冗余消减,更强调对最终相似度存在显著贡献的维度间差异。我们发现现有方法倾向于选择具有相似CDS的样本,导致核心集内CDS类型多样性降低,进而抑制模型性能。为此,我们通过集成CDS约束改进了五种经典选择方法。在三个数据集上的实验表明,所提方法能有效提升现有方法的通用性能。