The efficacy of machine learning has traditionally relied on the availability of increasingly larger datasets. However, large datasets pose storage challenges and contain non-influential samples, which could be ignored during training without impacting the final accuracy of the model. In response to these limitations, the concept of distilling the information on a dataset into a condensed set of (synthetic) samples, namely a distilled dataset, emerged. One crucial aspect is the selected architecture (usually ConvNet) for linking the original and synthetic datasets. However, the final accuracy is lower if the employed model architecture differs from the model used during distillation. Another challenge is the generation of high-resolution images, e.g., 128x128 and higher. In this paper, we propose Latent Dataset Distillation with Diffusion Models (LD3M) that combine diffusion in latent space with dataset distillation to tackle both challenges. LD3M incorporates a novel diffusion process tailored for dataset distillation, which improves the gradient norms for learning synthetic images. By adjusting the number of diffusion steps, LD3M also offers a straightforward way of controlling the trade-off between speed and accuracy. We evaluate our approach in several ImageNet subsets and for high-resolution images (128x128 and 256x256). As a result, LD3M consistently outperforms state-of-the-art distillation techniques by up to 4.8 p.p. and 4.2 p.p. for 1 and 10 images per class, respectively.
翻译:机器学习的效果传统上依赖于日益庞大的数据集。然而,大规模数据集不仅带来存储挑战,还包含无影响样本——这些样本在训练过程中可被忽略而不影响最终模型精度。为应对这些问题,将数据集信息提炼为精简(合成)样本集(即蒸馏数据集)的概念应运而生。其中一个关键环节是选择关联原始数据集与合成数据集的特定网络架构(通常为ConvNet)。但若实际使用的模型架构与蒸馏阶段不同,最终精度会下降。另一挑战则是高分辨率图像(如128×128及以上)的生成。本文提出基于扩散模型的潜在数据集蒸馏(LD3M),将潜在空间扩散与数据集蒸馏相结合以解决上述两大难题。LD3M引入专为数据集蒸馏设计的新型扩散过程,通过优化合成图像的梯度范数实现改进。通过调整扩散步数,LD3M还能以简单方式控制速度与精度间的权衡。我们在多个ImageNet子集及高分辨率图像(128×128和256×256)上评估该方法。实验结果表明,LD3M在每类1张和10张合成图像的设置下,分别以高达4.8个百分点和4.2个百分点的优势持续优于现有最优蒸馏技术。