One-shot coreset selection aims to select a representative subset of the training data, given a pruning rate, that can later be used to train future models while retaining high accuracy. State-of-the-art coreset selection methods pick the highest importance examples based on an importance metric and are found to perform well at low pruning rates. However, at high pruning rates, they suffer from a catastrophic accuracy drop, performing worse than even random sampling. This paper explores the reasons behind this accuracy drop both theoretically and empirically. We first propose a novel metric to measure the coverage of a dataset on a specific distribution by extending the classical geometric set cover problem to a distribution cover problem. This metric helps explain why coresets selected by SOTA methods at high pruning rates perform poorly compared to random sampling because of worse data coverage. We then propose a novel one-shot coreset selection method, Coverage-centric Coreset Selection (CCS), that jointly considers overall data coverage upon a distribution as well as the importance of each example. We evaluate CCS on five datasets and show that, at high pruning rates (e.g., 90%), it achieves significantly better accuracy than previous SOTA methods (e.g., at least 19.56% higher on CIFAR10) as well as random selection (e.g., 7.04% higher on CIFAR10) and comparable accuracy at low pruning rates. We make our code publicly available at https://github.com/haizhongzheng/Coverage-centric-coreset-selection.
翻译:一次性核心集选择旨在根据给定的剪枝率,从训练数据中选出一个代表性子集,该子集可用于后续训练模型并保持高精度。现有最优的核心集选择方法基于重要性度量选取最高重要性的样本,并在低剪枝率下表现良好。然而,在高剪枝率下,这些方法会遭遇灾难性的精度下降,甚至不如随机采样。本文从理论和实证两方面探讨了这一精度下降的原因。我们首先提出一种新度量,通过将经典几何集覆盖问题扩展为分布覆盖问题,来衡量数据集在特定分布上的覆盖率。这一度量有助于解释为何高剪枝率下现有方法选取的核心集因数据覆盖率更差而表现不及随机采样。接着,我们提出一种新的一次性核心集选择方法——以覆盖率为核心的核心集选择(CCS),该方法同时考虑数据在分布上的整体覆盖率和每个样本的重要性。我们在五个数据集上评估了CCS,结果表明:在高剪枝率(如90%)下,其精度显著优于现有最优方法(例如,在CIFAR10上至少提高19.56%)和随机选择(例如,在CIFAR10上提高7.04%),且在低剪枝率下精度相当。我们将代码公开于 https://github.com/haizhongzheng/Coverage-centric-coreset-selection。