Data-efficient learning has garnered significant attention, especially given the current trend of large multi-modal models. Recently, dataset distillation has become an effective approach by synthesizing data samples that are essential for network training. However, it remains to be explored which samples are essential for the dataset distillation process itself. In this work, we study the data efficiency and selection for the dataset distillation task. By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset, both theoretically and empirically. We propose to use the empirical loss value as a static data pruning criterion. To further compensate for the variation of the data value in training, we find the most contributing samples based on their causal effects on the distillation. The proposed selection strategy can efficiently exploit the training dataset, outperform the previous SOTA distillation algorithms, and consistently enhance the distillation algorithms, even on much larger-scale and more heterogeneous datasets, e.g., full ImageNet-1K and Kinetics-400. We believe this paradigm will open up new avenues in the dynamics of distillation and pave the way for efficient dataset distillation. Our code is available on https://github.com/silicx/GoldFromOres-BiLP.
翻译:数据高效学习已引起广泛关注,特别是在当前大规模多模态模型的发展趋势下。最近,数据集蒸馏通过合成对网络训练至关重要的数据样本,已成为一种有效方法。然而,哪些样本对数据集蒸馏过程本身是必需的,仍有待探索。在本工作中,我们研究了数据集蒸馏任务的数据效率与选择问题。通过重新表述蒸馏的动态过程,我们从理论和实证两方面揭示了真实数据集中固有的冗余性。我们提出使用经验损失值作为静态数据剪枝准则。为进一步补偿训练过程中数据价值的变化,我们根据样本对蒸馏过程的因果效应来寻找贡献最大的样本。所提出的选择策略能高效利用训练数据集,在性能上超越先前的SOTA蒸馏算法,并能持续增强蒸馏算法的效果——即使在更大规模、更异构的数据集(如完整ImageNet-1K和Kinetics-400)上也是如此。我们相信这一范式将为蒸馏动态机制开辟新途径,并为高效数据集蒸馏铺平道路。代码已发布于https://github.com/silicx/GoldFromOres-BiLP。