Dataset distillation is attracting more attention in machine learning as training sets continue to grow and the cost of training state-of-the-art models becomes increasingly high. By synthesizing datasets with high information density, dataset distillation offers a range of potential applications, including support for continual learning, neural architecture search, and privacy protection. Despite recent advances, we lack a holistic understanding of the approaches and applications. Our survey aims to bridge this gap by first proposing a taxonomy of dataset distillation, characterizing existing approaches, and then systematically reviewing the data modalities, and related applications. In addition, we summarize the challenges and discuss future directions for this field of research.
翻译:数据集蒸馏正因其训练集规模持续扩大以及训练最先进模型成本日益高昂而在机器学习领域受到越来越多的关注。通过合成具有高信息密度的数据集,数据集蒸馏提供了一系列潜在应用,包括支持持续学习、神经架构搜索和隐私保护。尽管近年来取得了进展,但我们仍缺乏对其方法和应用的整体理解。本综述旨在弥合这一差距,首先提出数据集蒸馏的分类体系,描述现有方法,然后系统性地回顾数据模态及相关应用。此外,我们总结了该领域面临的挑战并讨论了未来研究方向。