Diversity-Driven Synthesis: Enhancing Dataset Distillation through Directed Weight Adjustment

The sharp increase in data-related expenses has motivated research into condensing datasets while retaining the most informative features. Dataset distillation has thus recently come to the fore. This paradigm generates synthetic datasets that are representative enough to replace the original dataset in training a neural network. To avoid redundancy in these synthetic datasets, it is crucial that each element contains unique features and remains diverse from others during the synthesis stage. In this paper, we provide a thorough theoretical and empirical analysis of diversity within synthesized datasets. We argue that enhancing diversity can improve the parallelizable yet isolated synthesizing approach. Specifically, we introduce a novel method that employs dynamic and directed weight adjustment techniques to modulate the synthesis process, thereby maximizing the representativeness and diversity of each synthetic instance. Our method ensures that each batch of synthetic data mirrors the characteristics of a large, varying subset of the original dataset. Extensive experiments across multiple datasets, including CIFAR, Tiny-ImageNet, and ImageNet-1K, demonstrate the superior performance of our method, highlighting its effectiveness in producing diverse and representative synthetic datasets with minimal computational expense. Our code is available at https://github.com/AngusDujw/Diversity-Driven-Synthesis.https://github.com/AngusDujw/Diversity-Driven-Synthesis.

翻译：数据相关成本的急剧上升推动了在压缩数据集同时保留最具信息量特征的研究。因此，数据集蒸馏技术近来备受关注。该范式生成的合成数据集具有足够的代表性，可在训练神经网络时替代原始数据集。为避免这些合成数据集中的冗余，关键在于确保每个元素包含独特特征，并在合成阶段保持彼此间的多样性。本文对合成数据集内部的多样性进行了全面的理论与实证分析。我们认为，增强多样性可以改进当前可并行但相互隔离的合成方法。具体而言，我们提出了一种新颖的方法，采用动态定向权重调整技术来调控合成过程，从而最大化每个合成实例的代表性与多样性。我们的方法确保每批合成数据都能反映原始数据集中大量不同子集的特性。在包括CIFAR、Tiny-ImageNet和ImageNet-1K在内的多个数据集上进行的大量实验表明，我们的方法性能优越，突显了其能以最小计算成本生成多样且具代表性合成数据集的有效性。代码发布于 https://github.com/AngusDujw/Diversity-Driven-Synthesis。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日