Training a large and state-of-the-art machine learning model typically necessitates the use of large-scale datasets, which, in turn, makes the training and parameter-tuning process expensive and time-consuming. Some researchers opt to distil information from real-world datasets into tiny and compact synthetic datasets while maintaining their ability to train a well-performing model, hence proposing a data-efficient method known as Dataset Distillation (DD). Despite recent progress in this field, existing methods still underperform and cannot effectively replace large datasets. In this paper, unlike previous methods that focus solely on improving the efficacy of student distillation, we are the first to recognize the important interplay between expert and student. We argue the significant impact of expert smoothness when employing more potent expert trajectories in subsequent dataset distillation. Based on this, we introduce the integration of clipping loss and gradient penalty to regulate the rate of parameter changes in expert trajectories. Furthermore, in response to the sensitivity exhibited towards randomly initialized variables during distillation, we propose representative initialization for synthetic dataset and balanced inner-loop loss. Finally, we present two enhancement strategies, namely intermediate matching loss and weight perturbation, to mitigate the potential occurrence of cumulative errors. We conduct extensive experiments on datasets of different scales, sizes, and resolutions. The results demonstrate that the proposed method significantly outperforms prior methods.
翻译:训练大规模且最先进的机器学习模型通常需要使用大规模数据集,这进而使得训练和参数调优过程成本高昂且耗时。一些研究者选择从真实世界数据集中提取信息,并将其压缩为微小且紧凑的合成数据集,同时保持其训练出性能良好模型的能力,从而提出了一种数据高效方法,称为数据集蒸馏(Dataset Distillation, DD)。尽管该领域近年来取得了进展,但现有方法仍表现不佳,无法有效替代大型数据集。本文中,不同于以往仅关注提升学生蒸馏效率的方法,我们首次认识到专家与学生之间的重要交互作用。我们论证了在使用更强效的专家轨迹进行后续数据集蒸馏时,专家平滑性的显著影响。基于此,我们引入裁剪损失和梯度惩罚来调节专家轨迹中参数变化率。此外,针对蒸馏过程中对随机初始化变量表现出的敏感性,我们提出了合成数据集的代表性初始化与平衡内循环损失。最后,我们提出了两种增强策略,即中间匹配损失与权重扰动,以减轻累积误差的潜在发生。我们在不同规模、尺寸和分辨率的数据集上进行了广泛实验。结果表明,所提方法显著优于先前方法。