Training large AI models typically requires large-scale datasets in the machine learning process, making training and parameter-tuning process both time-consuming and costly. Some researchers address this problem by carefully synthesizing a very small number of highly representative and informative samples from real-world datasets. This approach, known as Dataset Distillation (DD), proposes a perspective for data-efficient learning. Despite recent progress in this field, the performance of existing methods still cannot meet expectations, and distilled datasets cannot effectively replace original datasets. In this paper, unlike previous methods that focus solely on improving the effectiveness of student distillation, we recognize and leverage the important mutual influence between expert and student models. We observed that the smoothness of expert trajectories has a significant impact on subsequent student parameter alignment. Based on this, we propose an effective DD framework named AST, standing for Alignment with Smooth and high-quality expert Trajectories. We devise the integration of clipping loss and gradient penalty to regulate the rate of parameter changes in expert trajectory generation. To further refine the student parameter alignment with expert trajectory, we put forward representative initialization for the synthetic dataset and balanced inner-loop loss in response to the sensitivity exhibited towards randomly initialized variables during distillation. We also propose two enhancement strategies, namely intermediate matching loss and weight perturbation, to mitigate the potential occurrence of cumulative errors. We conduct extensive experiments on datasets of different scales, sizes, and resolutions. The results demonstrate that the proposed method significantly outperforms prior methods.
翻译:训练大型AI模型通常需要大规模数据集,使得训练和参数调优过程既耗时又昂贵。研究者通过精心合成极少数具有高度代表性和信息量的真实样本,提出了一种名为数据集蒸馏(DD)的数据高效学习方法。尽管该领域近期取得进展,现有方法的性能仍不理想,蒸馏数据集无法有效替代原始数据集。本文不同于仅关注提升学生模型蒸馏效果的既往方法,我们识别并利用了专家模型与学生模型间的重要相互影响。观察到专家轨迹的平滑性对学生后续参数对齐具有显著影响,据此提出名为AST(对齐平滑高质量专家轨迹)的高效DD框架。我们设计了将裁剪损失与梯度惩罚相结合的方法,以调控专家轨迹生成中的参数变化速率。为进一步优化学生参数与专家轨迹的对齐,针对蒸馏过程中对随机初始化变量的敏感性,提出了合成数据集的代表性初始化与平衡内环损失。同时提出中间匹配损失与权重扰动两种增强策略,以缓解潜在累积误差问题。我们在不同规模、尺寸和分辨率的数据集上开展广泛实验,结果表明所提方法显著优于现有方法。