Many recent works on understanding deep learning try to quantify how much individual data instances influence the optimization and generalization of a model. Such attempts reveal characteristics and importance of individual instances, which may provide useful information in diagnosing and improving deep learning. However, most of the existing works on data valuation require actual training of a model, which often demands high-computational cost. In this paper, we provide a training-free data valuation score, called complexity-gap score, which is a data-centric score to quantify the influence of individual instances in generalization of two-layer overparameterized neural networks. The proposed score can quantify irregularity of the instances and measure how much each data instance contributes in the total movement of the network parameters during training. We theoretically analyze and empirically demonstrate the effectiveness of the complexity-gap score in finding `irregular or mislabeled' data instances, and also provide applications of the score in analyzing datasets and diagnosing training dynamics. Our code is publicly available at https://github.com/JJchy/CG_score
翻译:近年来,许多关于理解深度学习的研究致力于量化单个数据实例对模型优化与泛化的影响程度。这类尝试揭示了单个实例的特征与重要性,为诊断和改进深度学习提供了有用信息。然而,现有数据估值工作大多需要实际训练模型,这通常需要高昂的计算成本。本文提出一种无需训练的数据估值指标——复杂度差距分数,这是一种以数据为中心的评分方法,用于量化过参数化双层神经网络中单个实例对泛化的影响。该指标能够量化实例的不规则性,并衡量每个数据实例在训练过程中对网络参数总移动量的贡献程度。我们从理论上分析并实验验证了复杂度差距分数在发现"不规则或错误标注"数据实例方面的有效性,同时展示了该分数在数据集分析与训练动态诊断中的应用。我们的代码已公开于https://github.com/JJchy/CG_score