Label-Consistent Dataset Distillation with Detector-Guided Refinement

Dataset distillation (DD) aims to generate a compact yet informative dataset that achieves performance comparable to the original dataset, thereby reducing demands on storage and computational resources. Although diffusion models have made significant progress in dataset distillation, the generated surrogate datasets often contain samples with label inconsistencies or insufficient structural detail, leading to suboptimal downstream performance. To address these issues, we propose a detector-guided dataset distillation framework that explicitly leverages a pre-trained detector to identify and refine anomalous synthetic samples, thereby ensuring label consistency and improving image quality. Specifically, a detector model trained on the original dataset is employed to identify anomalous images exhibiting label mismatches or low classification confidence. For each defective image, multiple candidates are generated using a pre-trained diffusion model conditioned on the corresponding image prototype and label. The optimal candidate is then selected by jointly considering the detector's confidence score and dissimilarity to existing qualified synthetic samples, thereby ensuring both label accuracy and intra-class diversity. Experimental results demonstrate that our method can synthesize high-quality representative images with richer details, achieving state-of-the-art performance on the validation set.

翻译：数据集蒸馏（DD）旨在生成一个紧凑而信息丰富的数据集，使其性能与原始数据集相当，从而减少对存储和计算资源的需求。尽管扩散模型在数据集蒸馏方面取得了显著进展，但生成的替代数据集通常包含标签不一致或结构细节不足的样本，导致下游性能欠佳。为解决这些问题，我们提出了一种检测器引导的数据集蒸馏框架，该框架明确利用预训练的检测器来识别和优化异常的合成样本，从而确保标签一致性并提高图像质量。具体而言，我们采用在原始数据集上训练的检测器模型来识别表现出标签不匹配或低分类置信度的异常图像。对于每个有缺陷的图像，我们使用预训练的扩散模型（以相应的图像原型和标签为条件）生成多个候选图像。然后，通过综合考虑检测器的置信度得分以及与现有合格合成样本的差异度来选择最佳候选图像，从而确保标签准确性和类内多样性。实验结果表明，我们的方法能够合成具有更丰富细节的高质量代表性图像，在验证集上实现了最先进的性能。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

大语言模型同策略蒸馏研究综述

专知会员服务

18+阅读 · 4月5日

重审扩散模型：从生成式预训练到一步生成

专知会员服务

14+阅读 · 2025年6月12日

大型语言模型的知识蒸馏与数据集蒸馏：新兴趋势、挑战与未来方向

专知会员服务

46+阅读 · 2025年4月26日

【NeurIPS2023】基于频域的数据集蒸馏

专知会员服务

24+阅读 · 2023年11月16日