The sharp increase in data-related expenses has motivated research into condensing datasets while retaining the most informative features. Dataset distillation has thus recently come to the fore. This paradigm generates synthetic datasets that are representative enough to replace the original dataset in training a neural network. To avoid redundancy in these synthetic datasets, it is crucial that each element contains unique features and remains diverse from others during the synthesis stage. In this paper, we provide a thorough theoretical and empirical analysis of diversity within synthesized datasets. We argue that enhancing diversity can improve the parallelizable yet isolated synthesizing approach. Specifically, we introduce a novel method that employs dynamic and directed weight adjustment techniques to modulate the synthesis process, thereby maximizing the representativeness and diversity of each synthetic instance. Our method ensures that each batch of synthetic data mirrors the characteristics of a large, varying subset of the original dataset. Extensive experiments across multiple datasets, including CIFAR, Tiny-ImageNet, and ImageNet-1K, demonstrate the superior performance of our method, highlighting its effectiveness in producing diverse and representative synthetic datasets with minimal computational expense. Our code is available at https://github.com/AngusDujw/Diversity-Driven-Synthesis.https://github.com/AngusDujw/Diversity-Driven-Synthesis.
翻译:数据相关成本的急剧上升推动了在压缩数据集同时保留最具信息量特征的研究。因此,数据集蒸馏技术近来备受关注。该范式生成的合成数据集具有足够的代表性,可在训练神经网络时替代原始数据集。为避免这些合成数据集中的冗余,关键在于确保每个元素包含独特特征,并在合成阶段保持彼此间的多样性。本文对合成数据集内部的多样性进行了全面的理论与实证分析。我们认为,增强多样性可以改进当前可并行但相互隔离的合成方法。具体而言,我们提出了一种新颖的方法,采用动态定向权重调整技术来调控合成过程,从而最大化每个合成实例的代表性与多样性。我们的方法确保每批合成数据都能反映原始数据集中大量不同子集的特性。在包括CIFAR、Tiny-ImageNet和ImageNet-1K在内的多个数据集上进行的大量实验表明,我们的方法性能优越,突显了其能以最小计算成本生成多样且具代表性合成数据集的有效性。代码发布于 https://github.com/AngusDujw/Diversity-Driven-Synthesis。