Going Beyond Feature Similarity: Effective Dataset distillation based on Class-aware Conditional Mutual Information

Dataset distillation (DD) aims to minimize the time and memory consumption needed for training deep neural networks on large datasets, by creating a smaller synthetic dataset that has similar performance to that of the full real dataset. However, current dataset distillation methods often result in synthetic datasets that are excessively difficult for networks to learn from, due to the compression of a substantial amount of information from the original data through metrics measuring feature similarity, e,g., distribution matching (DM). In this work, we introduce conditional mutual information (CMI) to assess the class-aware complexity of a dataset and propose a novel method by minimizing CMI. Specifically, we minimize the distillation loss while constraining the class-aware complexity of the synthetic dataset by minimizing its empirical CMI from the feature space of pre-trained networks, simultaneously. Conducting on a thorough set of experiments, we show that our method can serve as a general regularization method to existing DD methods and improve the performance and training efficiency.

翻译：数据集蒸馏（DD）旨在通过创建性能与完整真实数据集相近的较小合成数据集，来最小化在大型数据集上训练深度神经网络所需的时间和内存消耗。然而，当前的数据集蒸馏方法通常会导致合成数据集对网络学习而言过于困难，这是因为通过度量特征相似性的指标（例如分布匹配（DM））压缩了原始数据中的大量信息。在本工作中，我们引入条件互信息（CMI）来评估数据集的类感知复杂度，并提出一种通过最小化CMI的新方法。具体而言，我们在最小化蒸馏损失的同时，通过同时最小化合成数据集从预训练网络特征空间的经验CMI，来约束合成数据集的类感知复杂度。通过一系列全面的实验，我们表明本方法可作为现有DD方法的通用正则化方法，并提升其性能与训练效率。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日