Data valuation, the task of quantifying the contribution of individual data points to model performance, has emerged as a fundamental challenge in machine learning. Game-theoretic approaches, such as the Banzhaf value, offer principled frameworks for fair data valuation; however, they suffer from exponential computational complexity. We address this challenge by developing efficient algorithms specifically tailored for computing Banzhaf values in $k$-nearest neighbor ($k$NN) classifiers. We first establish the theoretical hardness of the problem by proving that it is \#P-hard. Despite this intractability, we exploit the locality properties of $k$NN classifiers to develop practical exact algorithms. Our main contribution is a dynamic programming framework that achieves significant computational improvements: we present a pseudo-polynomial algorithm with $O(Wkn^2)$ time complexity for weighted $k$NN classifiers, where $W$ is the maximum sum of top-$k$ weights, and a specialized algorithm for unweighted $k$NN that achieves $O(nk^2)$ time complexity, that is, linear in the number of data points. We also offer efficient Monte Carlo estimation methods. Extensive experiments on real-world datasets demonstrate the practical efficiency of our approach and its effectiveness in data valuation applications.
翻译:数据价值评估(即量化单个数据点对模型性能贡献的任务)已成为机器学习领域的核心挑战。基于博弈论的方法(如Banzhaf值)为公平数据价值评估提供了理论框架,但存在指数级计算复杂度的瓶颈。我们针对$k$最近邻($k$NN)分类器开发了专门的高效算法来解决这一挑战。首先通过证明该问题为#P难问题确立其理论难度。尽管存在这种难解性,我们利用$k$NN分类器的局部特性设计了实用的精确算法。主要贡献是提出了动态规划框架,实现了显著的计算优化:针对加权$k$NN分类器,给出了时间复杂度为$O(Wkn^2)$的伪多项式算法(其中$W$为前$k$个最大权重的总和);针对未加权$k$NN,提出了时间复杂度为$O(nk^2)$的专用算法,即与数据点数量呈线性关系。此外还提供了高效蒙特卡洛估计方法。在真实数据集上的大量实验验证了本方法在实际应用中的高效性及其在数据价值评估中的有效性。