Dataset distillation condenses large datasets into synthetic subsets, achieving performance comparable to training on the full dataset while substantially reducing storage and computation costs. Most existing dataset distillation methods assume that all real instances contribute equally to the process. In practice, real-world datasets contain both informative and redundant or even harmful instances, and directly distilling the full dataset without considering data quality can degrade model performance. In this work, we present Influence-Weighted Distillation IWD, a principled framework that leverages influence functions to explicitly account for data quality in the distillation process. IWD assigns adaptive weights to each instance based on its estimated impact on the distillation objective, prioritizing beneficial data while downweighting less useful or harmful ones. Owing to its modular design, IWD can be seamlessly integrated into diverse dataset distillation frameworks. Our empirical results suggest that integrating IWD tends to improve the quality of distilled datasets and enhance model performance, with accuracy gains of up to 7.8%.
翻译:数据集蒸馏将大型数据集压缩为合成子集,在显著降低存储和计算成本的同时,实现与完整数据集训练相当的性能。现有的大多数数据集蒸馏方法假设所有真实实例在蒸馏过程中贡献均等。然而,实际应用中的数据集既包含信息丰富的实例,也包含冗余甚至有害的实例,若在蒸馏过程中不考虑数据质量而直接处理完整数据集,可能导致模型性能下降。本研究提出了影响力加权蒸馏(Influence-Weighted Distillation, IWD),这是一个基于原理的框架,利用影响力函数在蒸馏过程中显式地考量数据质量。IWD根据每个实例对蒸馏目标的预估影响分配自适应权重,优先处理有益数据,同时降低无用或有害数据的权重。得益于其模块化设计,IWD能够无缝集成到多种数据集蒸馏框架中。实验结果表明,集成IWD通常能提升蒸馏数据集的质量并增强模型性能,准确率最高可提升7.8%。