Dataset distillation (DD) aims to generate a compact yet informative dataset that achieves performance comparable to the original dataset, thereby reducing demands on storage and computational resources. Although diffusion models have made significant progress in dataset distillation, the generated surrogate datasets often contain samples with label inconsistencies or insufficient structural detail, leading to suboptimal downstream performance. To address these issues, we propose a detector-guided dataset distillation framework that explicitly leverages a pre-trained detector to identify and refine anomalous synthetic samples, thereby ensuring label consistency and improving image quality. Specifically, a detector model trained on the original dataset is employed to identify anomalous images exhibiting label mismatches or low classification confidence. For each defective image, multiple candidates are generated using a pre-trained diffusion model conditioned on the corresponding image prototype and label. The optimal candidate is then selected by jointly considering the detector's confidence score and dissimilarity to existing qualified synthetic samples, thereby ensuring both label accuracy and intra-class diversity. Experimental results demonstrate that our method can synthesize high-quality representative images with richer details, achieving state-of-the-art performance on the validation set.
翻译:数据集蒸馏(DD)旨在生成一个紧凑而信息丰富的数据集,使其性能与原始数据集相当,从而减少对存储和计算资源的需求。尽管扩散模型在数据集蒸馏方面取得了显著进展,但生成的替代数据集通常包含标签不一致或结构细节不足的样本,导致下游性能欠佳。为解决这些问题,我们提出了一种检测器引导的数据集蒸馏框架,该框架明确利用预训练的检测器来识别和优化异常的合成样本,从而确保标签一致性并提高图像质量。具体而言,我们采用在原始数据集上训练的检测器模型来识别表现出标签不匹配或低分类置信度的异常图像。对于每个有缺陷的图像,我们使用预训练的扩散模型(以相应的图像原型和标签为条件)生成多个候选图像。然后,通过综合考虑检测器的置信度得分以及与现有合格合成样本的差异度来选择最佳候选图像,从而确保标签准确性和类内多样性。实验结果表明,我们的方法能够合成具有更丰富细节的高质量代表性图像,在验证集上实现了最先进的性能。