The success of deep learning hinges on enormous data and large models, which require labor-intensive annotations and heavy computation costs. Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data, which can then be used to produce similar models as the ones trained with full data. Two prior methods are shown to achieve impressive results: (1) margin sampling that focuses on selecting points with high uncertainty, and (2) core-sets or clustering methods such as k-center for informative and diverse subsets. We are not aware of any work that combines these methods in a principled manner. To this end, we develop a novel and efficient factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions. To handle large datasets, we show a parallel algorithm to run on multiple machines with approximation guarantees. The proposed algorithm achieves similar or better performance compared to other strong baselines on vision datasets such as CIFAR-10, CIFAR-100, and ImageNet.
翻译:深度学习的发展依赖于海量数据和大规模模型,这需要大量的人工标注和昂贵的计算成本。子集选择是一个基础性问题,能够通过识别训练数据中的更小子集,进而生成与使用完整数据训练所得模型相似的模型。先前有两种方法展现出显著效果:(1)边际采样法,专注于选择具有高不确定性的数据点;(2)核心集或聚类方法(如K中心),用于选择信息丰富且多样化的子集。目前尚无工作以严谨方式将这两种方法结合。为此,我们提出一种新颖且高效的3倍近似算法,基于K中心与不确定性采样目标函数的加权和来计算子集。为处理大规模数据集,我们展示了一种可在多台机器上并行运行的算法,并具备近似保证。在CIFAR-10、CIFAR-100和ImageNet等视觉数据集上,所提算法相比其他强基线方法取得了相似或更优的性能。