Assessing the importance of individual training samples is a key challenge in machine learning. Traditional approaches retrain models with and without specific samples, which is computationally expensive and ignores dependencies between data points. We introduce LossVal, an efficient data valuation method that computes importance scores during neural network training by embedding a self-weighting mechanism into loss functions like cross-entropy and mean squared error. LossVal reduces computational costs, making it suitable for large datasets and practical applications. Experiments on classification and regression tasks across multiple datasets show that LossVal effectively identifies noisy samples and is able to distinguish helpful from harmful samples. We examine the gradient calculation of LossVal to highlight its advantages. The source code is available at: https://github.com/twibiral/LossVal
翻译:评估单个训练样本的重要性是机器学习中的关键挑战。传统方法通过包含或排除特定样本重新训练模型,计算成本高昂且忽略了数据点间的依赖关系。本文提出LossVal,一种高效的数据价值评估方法,通过将自加权机制嵌入交叉熵和均方误差等损失函数中,在神经网络训练过程中计算重要性分数。LossVal显著降低了计算成本,适用于大规模数据集和实际应用场景。在多个数据集的分类和回归任务上的实验表明,LossVal能有效识别噪声样本,并能区分有益样本与有害样本。我们通过分析LossVal的梯度计算阐明其优势。源代码已发布于:https://github.com/twibiral/LossVal