Data valuation is a growing research field that studies the influence of individual data points for machine learning (ML) models. Data Shapley, inspired by cooperative game theory and economics, is an effective method for data valuation. However, it is well-known that the Shapley value (SV) can be computationally expensive. Fortunately, Jia et al. (2019) showed that for K-Nearest Neighbors (KNN) models, the computation of Data Shapley is surprisingly simple and efficient. In this note, we revisit the work of Jia et al. (2019) and propose a more natural and interpretable utility function that better reflects the performance of KNN models. We derive the corresponding calculation procedure for the Data Shapley of KNN classifiers/regressors with the new utility functions. Our new approach, dubbed soft-label KNN-SV, achieves the same time complexity as the original method. We further provide an efficient approximation algorithm for soft-label KNN-SV based on locality sensitive hashing (LSH). Our experimental results demonstrate that Soft-label KNN-SV outperforms the original method on most datasets in the task of mislabeled data detection, making it a better baseline for future work on data valuation.
翻译:数据估值是一个新兴的研究领域,旨在探究单个数据点对机器学习模型的影响。受合作博弈论与经济学启发的Data Shapley方法是数据估值的有效手段。然而,众所周知Shapley值(SV)的计算代价高昂。幸运的是,Jia等人(2019)证明,对于K近邻(KNN)模型,Data Shapley的计算可以出奇地简单且高效。本文重新审视了Jia等人(2019)的工作,提出一种更自然且更具可解释性的效用函数,该函数能更准确地反映KNN模型的性能。我们针对新效用函数推导了KNN分类器/回归器对应的Data Shapley计算流程。这种被命名为软标签KNN-SV(soft-label KNN-SV)的新方法在保持原始方法相同时间复杂度的同时,进一步基于局部敏感哈希(LSH)提供了高效的近似算法。实验结果表明,在错误标注数据检测任务中,软标签KNN-SV在多数数据集上优于原始方法,为后续数据估值研究建立了更优的基准。