Dataset distillation aims to synthesize a compact yet representative dataset that preserves the essential characteristics of the original data for efficient model training. Existing methods mainly focus on improving data-synthetic alignment or scaling distillation to large datasets. In this work, we propose $\textbf{C}$ommittee $\textbf{V}$oting for $\textbf{D}$ataset $\textbf{D}$istillation ($\textbf{CV-DD}$), an orthogonal approach that leverages the collective knowledge of multiple models to produce higher-quality distilled data. We first establish a strong baseline that achieves state-of-the-art performance through modern architectural and optimization choices. By integrating distributions and predictions from multiple models and generating high-quality soft labels, our method captures a broader range of data characteristics, reduces model-specific bias and the impact of distribution shifts, and significantly improves generalization. This voting-based strategy enhances diversity and robustness, alleviates overfitting, and improves post-evaluation performance. Extensive experiments across multiple datasets and IPC settings demonstrate that CV-DD consistently outperforms single- and multi-model distillation methods and generalizes well to non-training-based frameworks and challenging synthetic-to-real transfer tasks. Code is available at: https://github.com/Jiacheng8/CV-DD.
翻译:数据集蒸馏旨在合成一个紧凑而具代表性的数据集,以保留原始数据的基本特征,从而实现高效的模型训练。现有方法主要关注提升数据合成对齐或将蒸馏扩展到大规模数据集。在本研究中,我们提出用于数据集蒸馏的委员会投票方法(CV-DD),这是一种利用多个模型的集体知识来生成更高质量蒸馏数据的正交方法。我们首先通过现代架构和优化选择建立了一个达到最先进性能的强基线。通过整合多个模型的分布和预测并生成高质量的软标签,我们的方法捕捉了更广泛的数据特征,减少了模型特定偏差和分布偏移的影响,并显著提升了泛化能力。这种基于投票的策略增强了多样性和鲁棒性,缓解了过拟合问题,并改善了后评估性能。在多个数据集和IPC设置下的大量实验表明,CV-DD始终优于单模型和多模型蒸馏方法,并能很好地泛化到非基于训练的框架以及具有挑战性的合成到真实迁移任务中。代码可在以下网址获取:https://github.com/Jiacheng8/CV-DD。