Not All Samples Should Be Utilized Equally: Towards Understanding and Improving Dataset Distillation

Dataset Distillation (DD) aims to synthesize a small dataset capable of performing comparably to the original dataset. Despite the success of numerous DD methods, theoretical exploration of this area remains unaddressed. In this paper, we take an initial step towards understanding various matching-based DD methods from the perspective of sample difficulty. We begin by empirically examining sample difficulty, measured by gradient norm, and observe that different matching-based methods roughly correspond to specific difficulty tendencies. We then extend the neural scaling laws of data pruning to DD to theoretically explain these matching-based methods. Our findings suggest that prioritizing the synthesis of easier samples from the original dataset can enhance the quality of distilled datasets, especially in low IPC (image-per-class) settings. Based on our empirical observations and theoretical analysis, we introduce the Sample Difficulty Correction (SDC) approach, designed to predominantly generate easier samples to achieve higher dataset quality. Our SDC can be seamlessly integrated into existing methods as a plugin with minimal code adjustments. Experimental results demonstrate that adding SDC generates higher-quality distilled datasets across 7 distillation methods and 6 datasets.

翻译：数据集蒸馏（DD）旨在合成一个能够与原始数据集性能相当的小型数据集。尽管众多DD方法已取得成功，但该领域的理论探索仍待深入。本文从样本难度的视角出发，初步探索了多种基于匹配的DD方法。我们首先通过梯度范数衡量样本难度进行实证分析，观察到不同的基于匹配的方法大致对应特定的难度倾向。随后，我们将数据剪枝的神经缩放定律扩展至DD领域，从理论上解释了这些基于匹配的方法。研究发现，在低IPC（每类图像数）设置下，优先合成原始数据集中较易的样本能够提升蒸馏数据集的质量。基于实证观察与理论分析，我们提出了样本难度校正（SDC）方法，该方法主要通过生成更易学习的样本来实现更高质量的数据集。我们的SDC可作为插件模块无缝集成到现有方法中，仅需极少的代码调整。实验结果表明，在7种蒸馏方法和6个数据集上，加入SDC均能生成更高质量的蒸馏数据集。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日