Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching

The ultimate goal of Dataset Distillation is to synthesize a small synthetic dataset such that a model trained on this synthetic set will perform equally well as a model trained on the full, real dataset. Until now, no method of Dataset Distillation has reached this completely lossless goal, in part due to the fact that previous methods only remain effective when the total number of synthetic samples is extremely small. Since only so much information can be contained in such a small number of samples, it seems that to achieve truly loss dataset distillation, we must develop a distillation method that remains effective as the size of the synthetic dataset grows. In this work, we present such an algorithm and elucidate why existing methods fail to generate larger, high-quality synthetic sets. Current state-of-the-art methods rely on trajectory-matching, or optimizing the synthetic data to induce similar long-term training dynamics as the real data. We empirically find that the training stage of the trajectories we choose to match (i.e., early or late) greatly affects the effectiveness of the distilled dataset. Specifically, early trajectories (where the teacher network learns easy patterns) work well for a low-cardinality synthetic set since there are fewer examples wherein to distribute the necessary information. Conversely, late trajectories (where the teacher network learns hard patterns) provide better signals for larger synthetic sets since there are now enough samples to represent the necessary complex patterns. Based on our findings, we propose to align the difficulty of the generated patterns with the size of the synthetic dataset. In doing so, we successfully scale trajectory matching-based methods to larger synthetic datasets, achieving lossless dataset distillation for the very first time. Code and distilled datasets are available at https://gzyaftermath.github.io/DATM.

翻译：数据集蒸馏的最终目标是合成一个精简的合成数据集，使得在该合成集上训练的模型性能与在全量真实数据集上训练的模型性能相当。到目前为止，尚无数据集蒸馏方法能完全实现这一无损目标，部分原因在于以往方法仅当合成样本总数极小时才能保持有效性。由于如此少量的样本所能承载的信息有限，要实现真正的无损数据集蒸馏，必须开发一种在合成数据集规模扩大时仍能保持有效性的蒸馏方法。本文提出了这样一种算法，并阐释了现有方法为何无法生成更大型的高质量合成集。当前最先进的方法依赖于轨迹匹配，即通过优化合成数据使其诱导与真实数据相似的长期训练动态。我们通过实验发现，所选择匹配轨迹的训练阶段（即早期或晚期）会显著影响蒸馏数据集的有效性。具体而言，早期轨迹（教师网络学习简单模式）适用于低基数合成集，因为此时信息可分散到较少样本中；而晚期轨迹（教师网络学习困难模式）则为更大规模合成集提供更优信号，因为此时有足够样本表征必要的复杂模式。基于此发现，我们提出将生成模式的难度与合成数据集规模对齐。通过这一方法，我们成功将基于轨迹匹配的方法扩展到更大规模的合成数据集，首次实现了无损数据集蒸馏。相关代码及蒸馏数据集可在 https://gzyaftermath.github.io/DATM 获取。