The ultimate goal of Dataset Distillation is to synthesize a small synthetic dataset such that a model trained on this synthetic set will perform equally well as a model trained on the full, real dataset. Until now, no method of Dataset Distillation has reached this completely lossless goal, in part due to the fact that previous methods only remain effective when the total number of synthetic samples is extremely small. Since only so much information can be contained in such a small number of samples, it seems that to achieve truly loss dataset distillation, we must develop a distillation method that remains effective as the size of the synthetic dataset grows. In this work, we present such an algorithm and elucidate why existing methods fail to generate larger, high-quality synthetic sets. Current state-of-the-art methods rely on trajectory-matching, or optimizing the synthetic data to induce similar long-term training dynamics as the real data. We empirically find that the training stage of the trajectories we choose to match (i.e., early or late) greatly affects the effectiveness of the distilled dataset. Specifically, early trajectories (where the teacher network learns easy patterns) work well for a low-cardinality synthetic set since there are fewer examples wherein to distribute the necessary information. Conversely, late trajectories (where the teacher network learns hard patterns) provide better signals for larger synthetic sets since there are now enough samples to represent the necessary complex patterns. Based on our findings, we propose to align the difficulty of the generated patterns with the size of the synthetic dataset. In doing so, we successfully scale trajectory matching-based methods to larger synthetic datasets, achieving lossless dataset distillation for the very first time. Code and distilled datasets are available at https://gzyaftermath.github.io/DATM.
翻译:数据集蒸馏的最终目标是合成一个小型合成数据集,使得在该合成集上训练的模型性能与在全量真实数据集上训练的模型完全相同。至今,尚无数据集蒸馏方法实现这一完全无损目标,部分原因在于先前方法仅在合成样本总数极小时有效。由于极少数样本所承载的信息有限,要实现真正的无损数据集蒸馏,必须开发一种随合成数据集规模增大仍保持有效性的蒸馏方法。本文提出此类算法,并阐释了现有方法为何无法生成更大规模的高质量合成集。当前最先进方法依赖于轨迹匹配,即通过优化合成数据使其诱导出与真实数据相似的长期训练动态。我们通过实验发现,所匹配轨迹的训练阶段(早期或晚期)对蒸馏数据集的有效性影响显著。具体而言,早期轨迹(教师网络学习简单模式时)适用于低基数合成集,因为此时可分配必要信息的样本较少;反之,晚期轨迹(教师网络学习困难模式时)为更大合成集提供了更好的信号,因为此时有足够样本表达所需的复杂模式。基于此发现,我们提出将生成模式的难度与合成数据集规模对齐。通过这种方式,我们成功将基于轨迹匹配的方法扩展到更大规模合成数据集,首次实现了无损数据集蒸馏。代码与蒸馏数据集已开源至 https://gzyaftermath.github.io/DATM。