In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compact datasets that preserve the knowledge of large-scale training sets while drastically reducing storage and computation. Recent advances in diffusion models have enabled training-free distillation by leveraging pre-trained generative priors; however, existing guidance strategies remain limited. Current score-based methods either perform unguided denoising or rely on simple mode-based guidance toward instance prototype centroids (IPC centroids), which often are rudimentary and suboptimal. We propose Manifold-Guided Distillation (ManifoldGD), a training-free diffusion-based framework that integrates manifold consistent guidance at every denoising timestep. Our method employs IPCs computed via a hierarchical, divisive clustering of VAE latent features, yielding a multi-scale coreset of IPCs that captures both coarse semantic modes and fine intra-class variability. Using a local neighborhood of the extracted IPC centroids, we create the latent manifold for each diffusion denoising timestep. At each denoising step, we project the mode-alignment vector onto the local tangent space of the estimated latent manifold, thus constraining the generation trajectory to remain manifold-faithful while preserving semantic consistency. This formulation improves representativeness, diversity, and image fidelity without requiring any model retraining. Empirical results demonstrate consistent gains over existing training-free and training-based baselines in terms of FID, l2 distance among real and synthetic dataset embeddings, and classification accuracy, establishing ManifoldGD as the first geometry-aware training-free data distillation framework.
翻译:近年来,大规模数据集在阻碍模型训练效率的同时也包含大量冗余概念。数据集蒸馏旨在合成紧凑的数据集,在极大减少存储和计算需求的同时保持大规模训练集的知识。扩散模型的最新进展通过利用预训练的生成先验实现了无训练蒸馏;然而,现有的引导策略仍存在局限。当前基于分数的方法要么执行无引导去噪,要么依赖简单的基于模态的引导指向实例原型质心(IPC质心),这些方法通常较为原始且非最优。我们提出流形引导蒸馏(ManifoldGD),这是一种基于扩散模型的无训练框架,在每一步去噪时间步中集成流形一致性引导。我们的方法通过VAE潜在特征的分层分裂聚类计算IPC,生成多尺度核心IPC集合,同时捕获粗粒度语义模态和细粒度类内变异。利用提取的IPC质心局部邻域,我们为每个扩散去噪时间步构建潜在流形。在每个去噪步骤中,我们将模态对齐向量投影到估计潜在流形的局部切空间,从而约束生成轨迹保持流形忠实性同时维持语义一致性。该方案在无需任何模型重训练的情况下提升了代表性、多样性和图像保真度。实证结果表明,在FID指标、真实与合成数据集嵌入的l2距离以及分类准确率方面,本方法均优于现有的无训练及基于训练的基线方法,确立了ManifoldGD作为首个几何感知的无训练数据蒸馏框架的地位。