Towards Principled Dataset Distillation: A Spectral Distribution Perspective

Dataset distillation (DD) aims to compress large-scale datasets into compact synthetic counterparts for efficient model training. However, existing DD methods exhibit substantial performance degradation on long-tailed datasets. We identify two fundamental challenges: heuristic design choices for distribution discrepancy measure and uniform treatment of imbalanced classes. To address these limitations, we propose Class-Aware Spectral Distribution Matching (CSDM), which reformulates distribution alignment via the spectrum of a well-behaved kernel function. This technique maps the original samples into frequency space, resulting in the Spectral Distribution Distance (SDD). To mitigate class imbalance, we exploit the unified form of SDD to perform amplitude-phase decomposition, which adaptively prioritizes the realism in tail classes. On CIFAR-10-LT, with 10 images per class, CSDM achieves a 14.0% improvement over state-of-the-art DD methods, with only a 5.7% performance drop when the number of images in tail classes decreases from 500 to 25, demonstrating strong stability on long-tailed data.

翻译：数据集蒸馏（DD）旨在将大规模数据集压缩为紧凑的合成数据集，以实现高效的模型训练。然而，现有的DD方法在长尾数据集上表现出显著的性能下降。我们识别出两个根本性挑战：分布差异度量的启发式设计选择以及对不平衡类别的均匀处理。为应对这些局限，我们提出了类感知谱分布匹配（CSDM），该方法通过一个性质良好的核函数的谱来重新构建分布对齐。该技术将原始样本映射到频率空间，从而得到谱分布距离（SDD）。为缓解类别不平衡问题，我们利用SDD的统一形式进行幅相分解，从而自适应地优先保证尾部类别的真实性。在CIFAR-10-LT数据集上，每类10张图像时，CSDM相比最先进的DD方法实现了14.0%的性能提升，且当尾部类别图像数量从500张减少到25张时，性能仅下降5.7%，这证明了其在长尾数据上的强大稳定性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

大型语言模型的知识蒸馏与数据集蒸馏：新兴趋势、挑战与未来方向

专知会员服务

46+阅读 · 2025年4月26日

基于大语言模型的时序知识图谱推理模型蒸馏方法

专知会员服务

37+阅读 · 2025年1月10日

【NeurIPS2023】基于频域的数据集蒸馏

专知会员服务

24+阅读 · 2023年11月16日

大模型如何端边部署？华盛顿Google提出《逐步蒸馏》法，以更少的训练数据和更小的模型规模超越更大的语言模型

专知会员服务

78+阅读 · 2023年5月8日