Training neural networks on a large dataset requires substantial computational costs. Dataset reduction selects or synthesizes data instances based on the large dataset, while minimizing the degradation in generalization performance from the full dataset. Existing methods utilize the neural network during the dataset reduction procedure, so the model parameter becomes important factor in preserving the performance after reduction. By depending upon the importance of parameters, this paper introduces a new reduction objective, coined LCMat, which Matches the Loss Curvatures of the original dataset and reduced dataset over the model parameter space, more than the parameter point. This new objective induces a better adaptation of the reduced dataset on the perturbed parameter region than the exact point matching. Particularly, we identify the worst case of the loss curvature gap from the local parameter region, and we derive the implementable upper bound of such worst-case with theoretical analyses. Our experiments on both coreset selection and condensation benchmarks illustrate that LCMat shows better generalization performances than existing baselines.
翻译:在大规模数据集上训练神经网络需要大量的计算成本。数据集精简方法基于原始大规模数据集选择或合成数据实例,同时最小化相较于完整数据集的泛化性能损失。现有方法在数据集精简过程中依赖神经网络,因此模型参数成为保持精简后性能的关键因素。基于参数的重要性,本文提出一种新的精简目标函数——LCMat,该函数在模型参数空间上匹配原始数据集与精简数据集的损失曲率,而非仅匹配参数点。相较于精确的参数点匹配,这一新目标使精简数据集在参数扰动区域具有更好的适应性。特别地,我们识别了局部参数区域中损失曲率差距的最坏情况,并通过理论分析推导出该最坏情况的可实现上界。我们在核心集选择与数据集浓缩基准上的实验表明,LCMat相较于现有基线方法表现出更优的泛化性能。