Dataset Distillation (DD) aims to synthesize a small dataset capable of performing comparably to the original dataset. Despite the success of numerous DD methods, theoretical exploration of this area remains unaddressed. In this paper, we take an initial step towards understanding various matching-based DD methods from the perspective of sample difficulty. We begin by empirically examining sample difficulty, measured by gradient norm, and observe that different matching-based methods roughly correspond to specific difficulty tendencies. We then extend the neural scaling laws of data pruning to DD to theoretically explain these matching-based methods. Our findings suggest that prioritizing the synthesis of easier samples from the original dataset can enhance the quality of distilled datasets, especially in low IPC (image-per-class) settings. Based on our empirical observations and theoretical analysis, we introduce the Sample Difficulty Correction (SDC) approach, designed to predominantly generate easier samples to achieve higher dataset quality. Our SDC can be seamlessly integrated into existing methods as a plugin with minimal code adjustments. Experimental results demonstrate that adding SDC generates higher-quality distilled datasets across 7 distillation methods and 6 datasets.
翻译:数据集蒸馏(DD)旨在合成一个能够与原始数据集性能相当的小型数据集。尽管众多DD方法已取得成功,但该领域的理论探索仍待深入。本文从样本难度的视角出发,初步探索了多种基于匹配的DD方法。我们首先通过梯度范数衡量样本难度进行实证分析,观察到不同的基于匹配的方法大致对应特定的难度倾向。随后,我们将数据剪枝的神经缩放定律扩展至DD领域,从理论上解释了这些基于匹配的方法。研究发现,在低IPC(每类图像数)设置下,优先合成原始数据集中较易的样本能够提升蒸馏数据集的质量。基于实证观察与理论分析,我们提出了样本难度校正(SDC)方法,该方法主要通过生成更易学习的样本来实现更高质量的数据集。我们的SDC可作为插件模块无缝集成到现有方法中,仅需极少的代码调整。实验结果表明,在7种蒸馏方法和6个数据集上,加入SDC均能生成更高质量的蒸馏数据集。