Deep learning technology has developed unprecedentedly in the last decade and has become the primary choice in many application domains. This progress is mainly attributed to a systematic collaboration in which rapidly growing computing resources encourage advanced algorithms to deal with massive data. However, it has gradually become challenging to handle the unlimited growth of data with limited computing power. To this end, diverse approaches are proposed to improve data processing efficiency. Dataset distillation, a dataset reduction method, addresses this problem by synthesizing a small typical dataset from substantial data and has attracted much attention from the deep learning community. Existing dataset distillation methods can be taxonomized into meta-learning and data matching frameworks according to whether they explicitly mimic the performance of target data. Although dataset distillation has shown surprising performance in compressing datasets, there are still several limitations such as distilling high-resolution data. This paper provides a holistic understanding of dataset distillation from multiple aspects, including distillation frameworks and algorithms, factorized dataset distillation, performance comparison, and applications. Finally, we discuss challenges and promising directions to further promote future studies on dataset distillation.
翻译:深度学习技术在过去十年中取得了前所未有地发展,并已成为众多应用领域的首选方法。这一进展主要归功于系统性的协同作用——快速增长的算力推动先进算法处理海量数据。然而,用有限的算力应对数据无限制增长逐渐成为挑战。为此,研究者提出了多种方法以提升数据处理效率。数据集蒸馏作为一种数据缩减方法,通过从海量数据中合成小型典型数据集来解决该问题,并引起了深度学习社区的广泛关注。根据是否显式模仿目标数据的性能,现有数据集蒸馏方法可归为元学习与数据匹配框架两类。尽管数据集蒸馏在数据压缩方面展现出惊人效果,但仍存在例如高分辨率数据蒸馏等若干局限。本文从蒸馏框架与算法、分解式数据集蒸馏、性能对比及实际应用等多角度对数据集蒸馏进行全面剖析。最后,我们讨论当前挑战与未来发展方向,以进一步推动数据集蒸馏的研究。