Data-efficient learning has garnered significant attention, especially given the current trend of large multi-modal models. Recently, dataset distillation becomes an effective approach for data-efficiency; however, the distillation process itself can still be inefficient. In this work, we model the dataset distillation task within the context of information transport. By observing the substantial data redundancy inherent in the distillation, we argue to put more emphasis on the samples' utility for the distillation task. We introduce and validate a family of data utility estimators and optimal data selection methods to exploit the most valuable samples. This strategy significantly reduces the training costs and extends various existing distillation algorithms to larger and more diversified datasets, e.g., in some cases only 0.04% training data is sufficient for comparable distillation performance. Our method consistently enhances the distillation algorithms, even on much larger-scale and more heterogeneous datasets, e.g. ImageNet-1K and Kinetics-400. This paradigm opens up new avenues in the dynamics of distillation and paves the way for efficient dataset distillation. Our code is available on https://github.com/silicx/GoldFromOres .
翻译:数据高效学习已引起广泛关注,特别是在当前大规模多模态模型的发展趋势下。最近,数据集蒸馏成为实现数据高效性的有效途径,然而蒸馏过程本身仍可能效率低下。本研究将数据集蒸馏任务置于信息传递的背景下进行建模。通过观察蒸馏过程中存在的大量数据冗余,我们主张更关注样本对蒸馏任务的效用。我们提出并验证了一类数据效用估计器与最优数据选择方法,以利用最具价值的样本。该策略显著降低了训练成本,并将多种现有蒸馏算法扩展到更大规模、更多样化的数据集上,例如在某些情况下仅需0.04%的训练数据即可达到相当的蒸馏性能。我们的方法持续提升了蒸馏算法效果,即使在更大规模、更具异质性的数据集(如ImageNet-1K和Kinetics-400)上也是如此。这一范式为蒸馏动力学开辟了新途径,并推动了高效数据集蒸馏的发展。我们的代码已在https://github.com/silicx/GoldFromOres上开源。