This work aims to address an open problem in data valuation literature concerning the efficient computation of Data Shapley for weighted $K$ nearest neighbor algorithm (WKNN-Shapley). By considering the accuracy of hard-label KNN with discretized weights as the utility function, we reframe the computation of WKNN-Shapley into a counting problem and introduce a quadratic-time algorithm, presenting a notable improvement from $O(N^K)$, the best result from existing literature. We develop a deterministic approximation algorithm that further improves computational efficiency while maintaining the key fairness properties of the Shapley value. Through extensive experiments, we demonstrate WKNN-Shapley's computational efficiency and its superior performance in discerning data quality compared to its unweighted counterpart.
翻译:本研究旨在解决数据估值文献中关于加权$K$最近邻算法的高效Data Shapley计算问题(WKNN-Shapley)。通过将离散化权重的硬标签KNN准确率作为效用函数,我们将WKNN-Shapley的计算重构为计数问题,并提出了一个二次时间复杂度算法,相较于现有文献中$O(N^K)$的最佳结果实现了显著改进。我们进一步开发了确定性近似算法,在保持Shapley值关键公平性属性的同时提升了计算效率。大量实验表明,WKNN-Shapley具有计算高效性,且在数据质量判别性能上优于其非加权版本。