Sequential Subset Matching for Dataset Distillation

Dataset distillation is a newly emerging task that synthesizes a small-size dataset used in training deep neural networks (DNNs) for reducing data storage and model training costs. The synthetic datasets are expected to capture the essence of the knowledge contained in real-world datasets such that the former yields a similar performance as the latter. Recent advancements in distillation methods have produced notable improvements in generating synthetic datasets. However, current state-of-the-art methods treat the entire synthetic dataset as a unified entity and optimize each synthetic instance equally. This static optimization approach may lead to performance degradation in dataset distillation. Specifically, we argue that static optimization can give rise to a coupling issue within the synthetic data, particularly when a larger amount of synthetic data is being optimized. This coupling issue, in turn, leads to the failure of the distilled dataset to extract the high-level features learned by the deep neural network (DNN) in the latter epochs. In this study, we propose a new dataset distillation strategy called Sequential Subset Matching (SeqMatch), which tackles this problem by adaptively optimizing the synthetic data to encourage sequential acquisition of knowledge during dataset distillation. Our analysis indicates that SeqMatch effectively addresses the coupling issue by sequentially generating the synthetic instances, thereby enhancing its performance significantly. Our proposed SeqMatch outperforms state-of-the-art methods in various datasets, including SVNH, CIFAR-10, CIFAR-100, and Tiny ImageNet. Our code is available at https://github.com/shqii1j/seqmatch.

翻译：数据集蒸馏是一项新兴任务，旨在合成小规模数据集用于训练深度神经网络（DNN），以降低数据存储和模型训练成本。合成数据集需捕捉真实世界数据集蕴含的知识本质，使其性能与原始数据集相当。近期蒸馏方法的进展在生成合成数据集方面取得了显著提升。然而，现有最优方法将整个合成数据集视为统一实体，并对每个合成实例进行同等优化。这种静态优化方法可能导致数据集蒸馏性能下降。具体而言，我们认为静态优化会在合成数据中引发耦合问题，尤其在优化大量合成数据时更为突出。该耦合问题进而导致蒸馏数据集无法提取深度神经网络（DNN）在后期训练周期中学习到的高层特征。本研究提出一种名为顺序子集匹配（SeqMatch）的新型数据集蒸馏策略，通过自适应优化合成数据来促进蒸馏过程中知识的顺序获取，从而解决上述问题。分析表明，SeqMatch通过顺序生成合成实例有效解决了耦合问题，显著提升了性能。在SVHN、CIFAR-10、CIFAR-100和Tiny ImageNet等多个数据集上，我们提出的SeqMatch均优于现有最优方法。相关代码已开源至https://github.com/shqii1j/seqmatch。