Dataset condensation always faces a constitutive trade-off: balancing performance and fidelity under extreme compression. Existing methods struggle with two bottlenecks: image-level selection methods (Coreset Selection, Dataset Quantization) suffer from inefficiency condensation, while pixel-level optimization (Dataset Distillation) introduces semantic distortion due to over-parameterization. With empirical observations, we find that a critical problem in dataset condensation is the oversight of color's dual role as an information carrier and a basic semantic representation unit. We argue that improving the colorfulness of condensed images is beneficial for representation learning. Motivated by this, we propose DC3: a Dataset Condensation framework with Color Compensation. After a calibrated selection strategy, DC3 utilizes the latent diffusion model to enhance the color diversity of an image rather than creating a brand-new one. Extensive experiments demonstrate the superior performance and generalization of DC3 that outperforms SOTA methods across multiple benchmarks. To the best of our knowledge, besides focusing on downstream tasks, DC3 is the first research to fine-tune pre-trained diffusion models with condensed datasets. The Frechet Inception Distance (FID) and Inception Score (IS) results prove that training networks with our high-quality datasets is feasible without model collapse or other degradation issues. Code and generated data are available at https://github.com/528why/Dataset-Condensation-with-Color-Compensation.
翻译:数据集压缩始终面临一个根本性的权衡:在极端压缩下平衡性能与保真度。现有方法存在两个瓶颈:图像级选择方法(核心集选择、数据集量化)存在压缩效率低下的问题,而像素级优化(数据集蒸馏)则因过度参数化引入语义失真。通过实证观察,我们发现数据集压缩中的一个关键问题在于忽视了色彩作为信息载体和基本语义表示单元的双重作用。我们认为,提升压缩后图像的色彩丰富度有利于表征学习。受此启发,我们提出DC3:一种带色彩补偿的数据集压缩框架。经过校准选择策略后,DC3利用潜在扩散模型增强图像的色彩多样性,而非创建全新图像。大量实验证明DC3在多个基准测试中均优于现有最优方法,展现出卓越的性能和泛化能力。据我们所知,除关注下游任务外,DC3是首个利用压缩数据集微调预训练扩散模型的研究。弗雷歇起始距离和起始分数的结果证明,使用我们生成的高质量数据集训练网络是可行的,且不会出现模型崩溃或其他性能退化问题。代码与生成数据可在 https://github.com/528why/Dataset-Condensation-with-Color-Compensation 获取。