Group Distributionally Robust Dataset Distillation with Risk Minimization

Dataset distillation (DD) has emerged as a widely adopted technique for crafting a synthetic dataset that captures the essential information of a training dataset, facilitating the training of accurate neural models. Its applications span various domains, including transfer learning, federated learning, and neural architecture search. The most popular methods for constructing the synthetic data rely on matching the convergence properties of training the model with the synthetic dataset and the training dataset. However, targeting the training dataset must be thought of as auxiliary in the same sense that the training set is an approximate substitute for the population distribution, and the latter is the data of interest. Yet despite its popularity, an aspect that remains unexplored is the relationship of DD to its generalization, particularly across uncommon subgroups. That is, how can we ensure that a model trained on the synthetic dataset performs well when faced with samples from regions with low population density? Here, the representativeness and coverage of the dataset become salient over the guaranteed training error at inference. Drawing inspiration from distributionally robust optimization, we introduce an algorithm that combines clustering with the minimization of a risk measure on the loss to conduct DD. We provide a theoretical rationale for our approach and demonstrate its effective generalization and robustness across subgroups through numerical experiments.

翻译：数据集蒸馏（DD）已成为一种广泛采用的技术，用于构建捕获训练数据集关键信息的合成数据集，从而促进准确神经模型的训练。其应用涵盖迁移学习、联邦学习和神经架构搜索等多个领域。最流行的合成数据构建方法依赖于匹配使用合成数据集和训练数据集训练模型的收敛特性。然而，需将目标训练数据集视为辅助性内容，因为训练集仅是总体分布（即我们真正关注的数据）的近似替代。尽管DD已得到广泛应用，但其与泛化能力的关系——尤其是在不常见子群中的表现——仍未得到充分探索。换言之，我们如何确保基于合成数据集训练的模型在面对低密度区域样本时仍能表现良好？在此场景下，数据集对推论的代表性和覆盖度比保证训练误差更为重要。受分布鲁棒优化启发，我们提出了一种结合聚类与损失风险度量最小化的DD算法。本文为所提方法提供了理论依据，并通过数值实验证明了其在子群间出色的泛化能力和鲁棒性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日