Contemporary machine learning requires training large neural networks on massive datasets and thus faces the challenges of high computational demands. Dataset distillation, as a recent emerging strategy, aims to compress real-world datasets for efficient training. However, this line of research currently struggle with large-scale and high-resolution datasets, hindering its practicality and feasibility. To this end, we re-examine the existing dataset distillation methods and identify three properties required for large-scale real-world applications, namely, realism, diversity, and efficiency. As a remedy, we propose RDED, a novel computationally-efficient yet effective data distillation paradigm, to enable both diversity and realism of the distilled data. Extensive empirical results over various neural architectures and datasets demonstrate the advancement of RDED: we can distill the full ImageNet-1K to a small dataset comprising 10 images per class within 7 minutes, achieving a notable 42% top-1 accuracy with ResNet-18 on a single RTX-4090 GPU (while the SOTA only achieves 21% but requires 6 hours).
翻译:当代机器学习需要在海量数据集上训练大型神经网络,因此面临高计算需求的挑战。数据集蒸馏作为近年兴起的一种策略,旨在压缩真实世界数据集以实现高效训练。然而,这一研究方向目前在大规模、高分辨率数据集上仍存在困难,制约了其实用性和可行性。为此,我们重新审视了现有数据集蒸馏方法,并确定了大规模真实世界应用所需的三个特性,即真实性、多样性和效率。针对这一问题,我们提出RDED——一种新颖的计算高效且有效的数据蒸馏范式,以同时实现蒸馏数据的多样性和真实性。在多种神经网络架构和数据集上的广泛实验结果证明了RDED的先进性:我们能在7分钟内将完整的ImageNet-1K蒸馏为每类仅含10张图像的小型数据集,在单块RTX-4090 GPU上以ResNet-18达到42%的Top-1准确率(而当前最优方法仅达到21%却需要6小时)。