Dataset distillation plays a crucial role in creating compact datasets with similar training performance compared with original large-scale ones. This is essential for addressing the challenges of data storage and training costs. Prevalent methods facilitate knowledge transfer by matching the gradients, embedding distributions, or training trajectories of synthetic images with those of the sampled original images. Although there are various matching objectives, currently the strategy for selecting original images is limited to naive random sampling. We argue that random sampling overlooks the evenness of the selected sample distribution, which may result in noisy or biased matching targets. Besides, the sample diversity is also not constrained by random sampling. Additionally, current methods predominantly focus on single-dimensional matching, where information is not fully utilized. To address these challenges, we propose a novel matching strategy called Dataset Distillation by Bidirectional REpresentAtive Matching (DREAM+), which selects representative original images for bidirectional matching. DREAM+ is applicable to a variety of mainstream dataset distillation frameworks and significantly reduces the number of distillation iterations by more than 15 times without affecting performance. Given sufficient training time, DREAM+ can further improve the performance and achieve state-of-the-art results. We have released the code at github.com/NUS-HPC-AI-Lab/DREAM+.
翻译:数据集蒸馏在创建与原始大规模数据集训练性能相当的紧凑数据集方面具有重要作用,这对于解决数据存储和训练成本挑战至关重要。现有主流方法通过匹配合成图像与采样原始图像的梯度、嵌入分布或训练轨迹来实现知识迁移。尽管存在多样的匹配目标,目前选择原始图像的策略仍局限于简单的随机采样。我们认为随机采样忽略了所选样本分布的均匀性,可能导致噪声化或偏差化的匹配目标。此外,随机采样也无法约束样本多样性。同时,当前方法主要关注单维度匹配,信息未得到充分利用。为解决上述挑战,我们提出一种名为"双向代表性匹配数据集蒸馏"(DREAM+)的新型匹配策略,该策略选择具有代表性的原始图像进行双向匹配。DREAM+可适用于各类主流数据集蒸馏框架,在不影响性能的前提下将蒸馏迭代次数减少超过15倍。在充足训练时间下,DREAM+可进一步提升性能并达到最先进结果。我们已在github.com/NUS-HPC-AI-Lab/DREAM+ 开源相关代码。