Dataset distillation aims to generate small datasets with little information loss as large-scale datasets for reducing storage and training costs. Recent state-of-the-art methods mainly constrain the sample generation process by matching synthetic images and the original ones regarding gradients, embedding distributions, or training trajectories. Although there are various matching objectives, currently the method for selecting original images is limited to naive random sampling. We argue that random sampling inevitably involves samples near the decision boundaries, which may provide large or noisy matching targets. Besides, random sampling cannot guarantee the evenness and diversity of the sample distribution. These factors together lead to large optimization oscillations and degrade the matching efficiency. Accordingly, we propose a novel matching strategy named as \textbf{D}ataset distillation by \textbf{RE}present\textbf{A}tive \textbf{M}atching (DREAM), where only representative original images are selected for matching. DREAM is able to be easily plugged into popular dataset distillation frameworks and reduce the matching iterations by 10 times without performance drop. Given sufficient training time, DREAM further provides significant improvements and achieves state-of-the-art performances.
翻译:数据集蒸馏旨在生成信息损失小的小型数据集,以替代大规模数据集,从而降低存储和训练成本。当前最优方法主要通过匹配合成图像与原始图像在梯度、嵌入分布或训练轨迹上的特征来约束样本生成过程。尽管存在多种匹配目标,但当前选取原始图像的方法仍局限于简单的随机采样。我们认为随机采样不可避免地会涉及决策边界附近的样本,这可能提供较大或含噪声的匹配目标。此外,随机采样无法保证样本分布的均匀性和多样性。这些因素共同导致较大的优化震荡,并降低了匹配效率。据此,我们提出了一种新颖的匹配策略——基于代表性匹配的数据集蒸馏(DREAM),该方法仅选取具有代表性的原始图像进行匹配。DREAM可轻松集成到主流数据集蒸馏框架中,在不降低性能的前提下将匹配迭代次数减少10倍。在训练时间充足的情况下,DREAM能进一步提供显著改进,达到当前最优性能。