Dataset distillation (DD) aims to compress large-scale datasets into compact synthetic counterparts for efficient model training. However, existing DD methods exhibit substantial performance degradation on long-tailed datasets. We identify two fundamental challenges: heuristic design choices for distribution discrepancy measure and uniform treatment of imbalanced classes. To address these limitations, we propose Class-Aware Spectral Distribution Matching (CSDM), which reformulates distribution alignment via the spectrum of a well-behaved kernel function. This technique maps the original samples into frequency space, resulting in the Spectral Distribution Distance (SDD). To mitigate class imbalance, we exploit the unified form of SDD to perform amplitude-phase decomposition, which adaptively prioritizes the realism in tail classes. On CIFAR-10-LT, with 10 images per class, CSDM achieves a 14.0% improvement over state-of-the-art DD methods, with only a 5.7% performance drop when the number of images in tail classes decreases from 500 to 25, demonstrating strong stability on long-tailed data.
翻译:数据集蒸馏(DD)旨在将大规模数据集压缩为紧凑的合成数据集,以实现高效的模型训练。然而,现有的DD方法在长尾数据集上表现出显著的性能下降。我们识别出两个根本性挑战:分布差异度量的启发式设计选择以及对不平衡类别的均匀处理。为应对这些局限,我们提出了类感知谱分布匹配(CSDM),该方法通过一个性质良好的核函数的谱来重新构建分布对齐。该技术将原始样本映射到频率空间,从而得到谱分布距离(SDD)。为缓解类别不平衡问题,我们利用SDD的统一形式进行幅相分解,从而自适应地优先保证尾部类别的真实性。在CIFAR-10-LT数据集上,每类10张图像时,CSDM相比最先进的DD方法实现了14.0%的性能提升,且当尾部类别图像数量从500张减少到25张时,性能仅下降5.7%,这证明了其在长尾数据上的强大稳定性。