Dataset distillation is attracting more attention in machine learning as training sets continue to grow and the cost of training state-of-the-art models becomes increasingly high. By synthesizing datasets with high information density, dataset distillation offers a range of potential applications, including support for continual learning, neural architecture search, and privacy protection. Despite recent advances, we lack a holistic understanding of the approaches and applications. Our survey aims to bridge this gap by first proposing a taxonomy of dataset distillation, characterizing existing approaches, and then systematically reviewing the data modalities, and related applications. In addition, we summarize the challenges and discuss future directions for this field of research.
翻译:数据集蒸馏在机器学习领域正受到越来越多的关注,因为训练集的规模持续增长,且训练最先进模型的成本日益高昂。通过合成具有高信息密度的数据集,数据集蒸馏提供了一系列潜在应用,包括支持持续学习、神经架构搜索和隐私保护。尽管近年来取得了进展,但我们仍缺乏对相关方法和应用的整体理解。本综述旨在填补这一空白,首先提出数据集蒸馏的分类法以刻画现有方法,然后系统地回顾数据模态及相关应用。此外,我们总结了该领域面临的挑战,并讨论了未来的研究方向。