Dataset distillation aims to generate a smaller but representative subset from a large dataset, which allows a model to be trained efficiently, meanwhile evaluating on the original testing data distribution to achieve decent performance. Many prior works have aimed to align with diverse aspects of the original datasets, such as matching the training weight trajectories, gradient, feature/BatchNorm distributions, etc. In this work, we show how to distill various large-scale datasets such as full ImageNet-1K/21K under a conventional input resolution of 224$\times$224 to achieve the best accuracy over all previous approaches, including SRe$^2$L, TESLA and MTT. To achieve this, we introduce a simple yet effective ${\bf C}$urriculum ${\bf D}$ata ${\bf A}$ugmentation ($\texttt{CDA}$) during data synthesis that obtains the accuracy on large-scale ImageNet-1K and 21K with 63.2% under IPC (Images Per Class) 50 and 36.1% under IPC 20, respectively. Finally, we show that, by integrating all our enhancements together, the proposed model beats the current state-of-the-art by more than 4% Top-1 accuracy on ImageNet-1K/21K and for the first time, reduces the gap to its full-data training counterpart to less than absolute 15%. Moreover, this work represents the inaugural success in dataset distillation on larger-scale ImageNet-21K under the standard 224$\times$224 resolution. Our code and distilled ImageNet-21K dataset of 20 IPC, 2K recovery budget are available at https://github.com/VILA-Lab/SRe2L/tree/main/CDA.
翻译:数据集蒸馏旨在从大型数据集中生成一个更小但具有代表性的子集,从而使模型能够高效训练,同时基于原始测试数据分布评估并达到可观性能。此前许多研究致力于对齐原始数据集的各个维度,例如匹配训练权重轨迹、梯度、特征/批归一化分布等。在本工作中,我们展示了如何对ImageNet-1K/21K等大规模数据集,在常规224×224输入分辨率下进行蒸馏,以超越包括SRe²L、TESLA和MTT在内的所有先前方法,取得最佳精度。为此,我们引入了一种简单而有效的课程数据增强方法($\texttt{CDA}$),在数据合成阶段应用,使大规模ImageNet-1K和21K数据集在每类图像数(IPC)为50和20时分别达到63.2%和36.1%的精度。最后,我们证明,通过整合所有改进,所提模型在ImageNet-1K/21K上的Top-1精度比当前最优方法提升超过4%,并首次将全数据训练对应方法的差距缩小至绝对15%以内。此外,本工作首次成功实现了在标准224×224分辨率下对更大规模ImageNet-21K数据集的数据集蒸馏。我们的代码以及包含20 IPC、2K恢复预算的蒸馏后的ImageNet-21K数据集已在https://github.com/VILA-Lab/SRe2L/tree/main/CDA 开放获取。