Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation

The Shapley value provides a principled foundation for data valuation, but exact computation is #P-hard due to the exponential coalition space. Existing accelerations remain global and ignore a structural property of modern predictors: for a given test instance, only a small subset of training points influences the prediction. We formalize this model-induced locality through support sets defined by the model's computational pathway (e.g., neighbors in KNN, leaves in trees, receptive fields in GNNs), showing that Shapley computation can be projected onto these supports without loss when locality is exact. This reframes Shapley evaluation as a structured data processing problem over overlapping support-induced subset families rather than exhaustive coalition enumeration. We prove that the intrinsic complexity of Local Shapley is governed by the number of distinct influential subsets, establishing an information-theoretic lower bound on retraining operations. Guided by this result, we propose LSMR (Local Shapley via Model Reuse), an optimal subset-centric algorithm that trains each influential subset exactly once via support mapping and pivot scheduling. For larger supports, we develop LSMR-A, a reuse-aware Monte Carlo estimator that remains unbiased with exponential concentration, with runtime determined by the number of distinct sampled subsets rather than total draws. Experiments across multiple model families demonstrate substantial retraining reductions and speedups while preserving high valuation fidelity.

翻译：Shapley值为数据价值评估提供了原则性基础，但由于指数级联盟空间的存在，其精确计算属于#P难问题。现有加速方法仍保持全局性，忽略了现代预测器的一个结构性特性：对于给定的测试实例，仅训练数据中的一小部分子集会影响预测结果。我们通过模型计算路径定义的支撑集（例如KNN中的邻近点、树模型中的叶节点、GNN中的感受野）将这种模型诱导的局部性形式化，证明当局部性精确成立时，Shapley计算可无损投影至这些支撑集上。这将Shapley评估重新构建为在重叠支撑诱导子集族上的结构化数据处理问题，而非穷举联盟枚举。我们证明局部Shapley的内在复杂度由不同影响子集的数量决定，从而建立了关于重训练操作的信息论下界。基于此结果，我们提出LSMR（基于模型复用的局部Shapley）——一种最优的以子集为中心的算法，通过支撑映射与枢纽调度对每个影响子集仅执行一次训练。针对较大支撑集，我们进一步开发LSMR-A，这是一种保持无偏性且具有指数集中性的复用感知蒙特卡洛估计器，其运行时间由采样子集的种类数而非总采样次数决定。跨多模型族的实验表明，该方法在保持高价值评估保真度的同时，实现了显著的重训练削减与加速效果。