In the era of deep learning, training deep neural networks often requires extensive data, leading to substantial costs. Dataset condensation addresses this by learning a small synthetic set that preserves essential information from the original large-scale dataset. Nowadays, optimization-oriented methods dominate dataset condensation for state-of-the-art (SOTA) results, but their computationally intensive bi-level optimization hinders practicality with large datasets. To enhance efficiency, as alternative solutions, Distribution-Matching (DM)-based methods reduce costs by aligning the representation distributions of real and synthetic examples. However, current DM-based methods still yield less comparable results to SOTA optimization-oriented methods. In this paper, we argue that existing DM-based methods overlook the higher-order alignment of the distributions, which may lead to sub-optimal matching results. Inspired by this, we propose a new DM-based method named as Efficient Dataset Condensation by Higher-Order Distribution Alignment (ECHO). Specifically, rather than only aligning the first-order moment of the representation distributions as previous methods, we learn synthetic examples via further aligning the higher-order moments of the representation distributions of real and synthetic examples based on the classical theory of reproducing kernel Hilbert space. Experiments demonstrate the proposed method achieves a significant performance boost while maintaining efficiency across various scenarios.
翻译:在深度学习时代,训练深度神经网络通常需要大量数据,导致高昂成本。数据集浓缩通过学习一个保留原始大规模数据集关键信息的小型合成集来解决此问题。目前,以优化为导向的方法主导着数据集浓缩领域并取得最优结果,但其计算密集型的双层优化限制了在大数据集上的实用性。为提升效率,基于分布匹配的替代方法通过对齐真实样本与合成样本的表示分布来降低成本。然而,现有基于分布匹配的方法仍难以达到与最优化方法相当的结果。本文认为,现有基于分布匹配的方法忽略了分布的高阶对齐,这可能导致次优的匹配结果。受此启发,我们提出一种新的基于分布匹配的方法——基于高阶分布对齐的高效数据集浓缩(ECHO)。具体而言,不同于以往仅对齐表示分布的一阶矩的方法,我们基于再生核希尔伯特空间的经典理论,通过进一步对齐真实样本与合成样本的表示分布的高阶矩来学习合成样本。实验表明,所提方法在保持效率的同时,在各种场景下均实现了显著的性能提升。