Dataset Distillation as Pushforward Optimal Quantization

Dataset distillation aims to find a synthetic training set such that training on the synthetic data achieves similar performance to training on real data, with orders of magnitude less computational requirements. Existing methods can be broadly categorized as either bi-level optimization problems that have neural network training heuristics as the lower level problem, or disentangled methods that bypass the bi-level optimization by matching distributions of data. The latter method has the major advantages of speed and scalability in terms of size of both training and distilled datasets. We demonstrate that when equipped with an encoder-decoder structure, the empirically successful disentangled methods can be reformulated as an optimal quantization problem, where a finite set of points is found to approximate the underlying probability measure by minimizing the expected projection distance. In particular, we link existing disentangled dataset distillation methods to the classical optimal quantization and Wasserstein barycenter problems, demonstrating consistency of distilled datasets for diffusion-based generative priors. We propose Dataset Distillation by Optimal Quantization, based on clustering in a latent space. Compared to the previous SOTA method D\textsuperscript{4}M, we achieve better performance and inter-model generalization on the ImageNet-1K dataset with trivial additional computation, and SOTA performance in higher image-per-class settings. Using the distilled noise initializations in a stronger diffusion transformer model, we obtain SOTA distillation performance on ImageNet-1K and its subsets, outperforming diffusion guidance methods.

翻译：数据集蒸馏旨在寻找一个合成训练集，使得在该合成数据上的训练能达到与真实数据训练相近的性能，同时计算需求降低数个数量级。现有方法可大致分为两类：一类是将神经网络训练启发式方法作为下层问题的双层优化问题，另一类是通过匹配数据分布来规避双层优化的解耦方法。后一类方法在训练集和蒸馏数据集规模方面具有速度和可扩展性的主要优势。我们证明，当配备编码器-解码器结构时，经验上成功的解耦方法可被重新表述为最优量化问题，即通过最小化期望投影距离来寻找有限点集以逼近基础概率测度。具体而言，我们将现有解耦式数据集蒸馏方法与经典最优量化及Wasserstein重心问题建立联系，证明了基于扩散生成先验的蒸馏数据集的一致性。我们提出基于潜在空间聚类的"最优量化数据集蒸馏法"。与先前的最先进方法D\textsuperscript{4}M相比，我们在ImageNet-1K数据集上以可忽略的额外计算成本实现了更优的性能和跨模型泛化能力，并在更高每类图像数量的设定下取得最先进性能。通过在更强的扩散Transformer模型中使用蒸馏得到的噪声初始化，我们在ImageNet-1K及其子集上获得了最先进的蒸馏性能，超越了扩散引导方法。