Herein, we propose a novel dataset distillation method for constructing small informative datasets that preserve the information of the large original datasets. The development of deep learning models is enabled by the availability of large-scale datasets. Despite unprecedented success, large-scale datasets considerably increase the storage and transmission costs, resulting in a cumbersome model training process. Moreover, using raw data for training raises privacy and copyright concerns. To address these issues, a new task named dataset distillation has been introduced, aiming to synthesize a compact dataset that retains the essential information from the large original dataset. State-of-the-art (SOTA) dataset distillation methods have been proposed by matching gradients or network parameters obtained during training on real and synthetic datasets. The contribution of different network parameters to the distillation process varies, and uniformly treating them leads to degraded distillation performance. Based on this observation, we propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance by automatically assigning importance weights to different network parameters during distillation, thereby synthesizing more robust distilled datasets. IADD demonstrates superior performance over other SOTA dataset distillation methods based on parameter matching on multiple benchmark datasets and outperforms them in terms of cross-architecture generalization. In addition, the analysis of self-adaptive weights demonstrates the effectiveness of IADD. Furthermore, the effectiveness of IADD is validated in a real-world medical application such as COVID-19 detection.
翻译:本文提出了一种新型数据集蒸馏方法,用于构建保留大型原始数据集信息的小型信息性数据集。深度学习模型的发展得益于大规模数据集的可用性。尽管取得了前所未有的成功,但大规模数据集显著增加了存储和传输成本,导致模型训练过程繁琐。此外,使用原始数据训练还引发了隐私和版权问题。为解决这些问题,引入了名为数据集蒸馏的新任务,旨在合成一个紧凑数据集,同时保留大型原始数据集中的关键信息。目前最先进(SOTA)的数据集蒸馏方法通过匹配在真实和合成数据集上训练时获得的梯度或网络参数来实现。不同网络参数对蒸馏过程的贡献各异,统一对待它们会导致蒸馏性能下降。基于这一观察,我们提出了一种重要性感知的自适应数据集蒸馏(IADD)方法,该方法能够在蒸馏过程中自动为不同网络参数分配重要性权重,从而合成更稳健的蒸馏数据集,提升蒸馏性能。在多个基准数据集上,IADD在基于参数匹配的其他SOTA数据集蒸馏方法中展现出卓越性能,并在跨架构泛化方面表现更优。此外,自适应权重的分析证明了IADD的有效性。IADD的有效性还在真实的医学应用(如COVID-19检测)中得到了验证。