Dataset distillation methods have demonstrated remarkable performance for neural networks trained with very limited training data. However, a significant challenge arises in the form of architecture overfitting: the distilled training data synthesized by a specific network architecture (i.e., training network) generates poor performance when trained by other network architectures (i.e., test networks). This paper addresses this issue and proposes a series of approaches in both architecture designs and training schemes which can be adopted together to boost the generalization performance across different network architectures on the distilled training data. We conduct extensive experiments to demonstrate the effectiveness and generality of our methods. Particularly, across various scenarios involving different sizes of distilled data, our approaches achieve comparable or superior performance to existing methods when training on the distilled data using networks with larger capacities.
翻译:数据集蒸馏方法在极少量训练数据下训练的神经网络中展现出卓越性能。然而,一个重大挑战源于架构过拟合:由特定网络架构(即训练网络)生成的蒸馏训练数据,在其他网络架构(即测试网络)训练时会产生较差的性能。本文针对这一问题,提出了一系列可在架构设计和训练方案中联合采用的策略,以提升蒸馏训练数据在不同网络架构间的泛化性能。我们通过大量实验证明了所提方法的有效性与通用性。特别地,在涉及不同蒸馏数据规模的多种场景下,当使用容量更大的网络对蒸馏数据进行训练时,我们的方法达到了与现有方法相当或更优的性能。