Dataset distillation or condensation aims to condense a large-scale training dataset into a much smaller synthetic one such that the training performance of distilled and original sets on neural networks are similar. Although the number of training samples can be reduced substantially, current state-of-the-art methods heavily rely on enormous soft labels to achieve satisfactory performance. As a result, the required storage can be comparable even to original datasets, especially for large-scale ones. To solve this problem, instead of storing these heavy labels, we propose a novel label-lightening framework termed HeLlO aiming at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images. Specifically, to construct such projectors, we leverage prior knowledge in open-source foundation models, e.g., CLIP, and introduce a LoRA-like fine-tuning strategy to mitigate the gap between pre-trained and target distributions, so that original models for soft-label generation can be distilled into a group of low-rank matrices. Moreover, an effective image optimization method is proposed to further mitigate the potential error between the original and distilled label generators. Extensive experiments demonstrate that with only about 0.003% of the original storage required for a complete set of soft labels, we achieve comparable performance to current state-of-the-art dataset distillation methods on large-scale datasets. Our code will be available.
翻译:数据集蒸馏或压缩旨在将大规模训练数据集压缩为规模小得多的合成数据集,使得神经网络在蒸馏集和原始集上的训练性能相近。尽管训练样本数量可大幅减少,当前最先进方法严重依赖海量软标签才能达到满意性能。这导致所需存储空间甚至可与原始数据集相当,尤其对于大规模数据集。为解决此问题,我们提出一种名为HeLlO的新型标签轻量化框架,旨在构建有效的图像到标签投影器,从而能够直接从合成图像在线生成合成标签,而非存储这些"重"标签。具体而言,为构建此类投影器,我们利用开源基础模型(如CLIP)中的先验知识,并引入类似LoRA的微调策略以弥合预训练分布与目标分布之间的差距,使得原始软标签生成模型可被蒸馏为一组低秩矩阵。此外,我们提出一种有效的图像优化方法,以进一步减小原始标签生成器与蒸馏标签生成器之间的潜在误差。大量实验表明,仅需完整软标签集约0.003%的原始存储空间,我们即可在大规模数据集上取得与当前最先进数据集蒸馏方法相当的性能。我们的代码将公开提供。