Dataset distillation (DD) has emerged as a widely adopted technique for crafting a synthetic dataset that captures the essential information of a training dataset, facilitating the training of accurate neural models. Its applications span various domains, including transfer learning, federated learning, and neural architecture search. The most popular methods for constructing the synthetic data rely on matching the convergence properties of training the model with the synthetic dataset and the training dataset. However, targeting the training dataset must be thought of as auxiliary in the same sense that the training set is an approximate substitute for the population distribution, and the latter is the data of interest. Yet despite its popularity, an aspect that remains unexplored is the relationship of DD to its generalization, particularly across uncommon subgroups. That is, how can we ensure that a model trained on the synthetic dataset performs well when faced with samples from regions with low population density? Here, the representativeness and coverage of the dataset become salient over the guaranteed training error at inference. Drawing inspiration from distributionally robust optimization, we introduce an algorithm that combines clustering with the minimization of a risk measure on the loss to conduct DD. We provide a theoretical rationale for our approach and demonstrate its effective generalization and robustness across subgroups through numerical experiments.
翻译:数据集蒸馏(DD)已成为一种广泛采用的技术,用于构建捕获训练数据集关键信息的合成数据集,从而促进准确神经模型的训练。其应用涵盖迁移学习、联邦学习和神经架构搜索等多个领域。最流行的合成数据构建方法依赖于匹配使用合成数据集和训练数据集训练模型的收敛特性。然而,需将目标训练数据集视为辅助性内容,因为训练集仅是总体分布(即我们真正关注的数据)的近似替代。尽管DD已得到广泛应用,但其与泛化能力的关系——尤其是在不常见子群中的表现——仍未得到充分探索。换言之,我们如何确保基于合成数据集训练的模型在面对低密度区域样本时仍能表现良好?在此场景下,数据集对推论的代表性和覆盖度比保证训练误差更为重要。受分布鲁棒优化启发,我们提出了一种结合聚类与损失风险度量最小化的DD算法。本文为所提方法提供了理论依据,并通过数值实验证明了其在子群间出色的泛化能力和鲁棒性。