Coreset selection targets the challenge of finding a small, representative subset of a large dataset that preserves essential patterns for effective machine learning. Although several surveys have examined data reduction strategies before, most focus narrowly on either classical geometry-based methods or active learning techniques. In contrast, this survey presents a more comprehensive view by unifying three major lines of coreset research, namely, training-free, training-oriented, and label-free approaches, into a single taxonomy. We present subfields often overlooked by existing work, including submodular formulations, bilevel optimization, and recent progress in pseudo-labeling for unlabeled datasets. Additionally, we examine how pruning strategies influence generalization and neural scaling laws, offering new insights that are absent from prior reviews. Finally, we compare these methods under varying computational, robustness, and performance demands and highlight open challenges, such as robustness, outlier filtering, and adapting coreset selection to foundation models, for future research.
翻译:核心集选择旨在解决从大规模数据集中寻找小型代表性子集的挑战,该子集需保留机器学习有效性的关键模式。尽管已有若干综述考察过数据缩减策略,但多数仅聚焦于经典的基于几何的方法或主动学习技术。相比之下,本综述通过将核心集研究的三大主线——即免训练、面向训练和无标签方法——统一为单一分类体系,提供了更全面的视角。我们介绍了现有工作中常被忽视的子领域,包括子模优化、双层优化以及未标记数据集伪标签技术的最新进展。此外,我们分析了剪枝策略如何影响泛化能力和神经缩放定律,提供了先前综述中未曾涉及的新见解。最后,我们在不同计算、鲁棒性和性能需求下比较了这些方法,并指出了未来研究的开放挑战,如鲁棒性、异常值过滤以及核心集选择在基础模型中的适应性。