数据集压缩中的色彩补偿 (Dataset Condensation with Color Compensation)

Dataset condensation always faces a constitutive trade-off: balancing performance and fidelity under extreme compression. Existing methods struggle with two bottlenecks: image-level selection methods (Coreset Selection, Dataset Quantization) suffer from inefficiency condensation, while pixel-level optimization (Dataset Distillation) introduces semantic distortion due to over-parameterization. With empirical observations, we find that a critical problem in dataset condensation is the oversight of color's dual role as an information carrier and a basic semantic representation unit. We argue that improving the colorfulness of condensed images is beneficial for representation learning. Motivated by this, we propose DC3: a Dataset Condensation framework with Color Compensation. After a calibrated selection strategy, DC3 utilizes the latent diffusion model to enhance the color diversity of an image rather than creating a brand-new one. Extensive experiments demonstrate the superior performance and generalization of DC3 that outperforms SOTA methods across multiple benchmarks. To the best of our knowledge, besides focusing on downstream tasks, DC3 is the first research to fine-tune pre-trained diffusion models with condensed datasets. The Frechet Inception Distance (FID) and Inception Score (IS) results prove that training networks with our high-quality datasets is feasible without model collapse or other degradation issues. Code and generated data are available at https://github.com/528why/Dataset-Condensation-with-Color-Compensation.

翻译：数据集压缩始终面临一个根本性的权衡：在极端压缩下平衡性能与保真度。现有方法存在两个瓶颈：图像级选择方法（核心集选择、数据集量化）存在压缩效率低下的问题，而像素级优化（数据集蒸馏）则因过度参数化引入语义失真。通过实证观察，我们发现数据集压缩中的一个关键问题在于忽视了色彩作为信息载体和基本语义表示单元的双重作用。我们认为，提升压缩后图像的色彩丰富度有利于表征学习。受此启发，我们提出DC3：一种带色彩补偿的数据集压缩框架。经过校准选择策略后，DC3利用潜在扩散模型增强图像的色彩多样性，而非创建全新图像。大量实验证明DC3在多个基准测试中均优于现有最优方法，展现出卓越的性能和泛化能力。据我们所知，除关注下游任务外，DC3是首个利用压缩数据集微调预训练扩散模型的研究。弗雷歇起始距离和起始分数的结果证明，使用我们生成的高质量数据集训练网络是可行的，且不会出现模型崩溃或其他性能退化问题。代码与生成数据可在 https://github.com/528why/Dataset-Condensation-with-Color-Compensation 获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日