The Shapley value is a ubiquitous framework for attribution in machine learning, encompassing feature importance, data valuation, and causal inference. However, its exact computation is generally intractable, necessitating efficient approximation methods. While the most effective and popular estimators leverage the paired sampling heuristic to reduce estimation error, the theoretical mechanism driving this improvement has remained opaque. In this work, we provide an elegant and fundamental justification for paired sampling: we prove that the Shapley value depends exclusively on the odd component of the set function, and that paired sampling orthogonalizes the regression objective to filter out the irrelevant even component. Leveraging this insight, we propose OddSHAP, a novel consistent estimator that performs polynomial regression solely on the odd subspace. By utilizing the Fourier basis to isolate this subspace and employing a proxy model to identify high-impact interactions, OddSHAP overcomes the combinatorial explosion of higher-order approximations. Through an extensive benchmark evaluation, we find that OddSHAP achieves state-of-the-art estimation accuracy.
翻译:Shapley值是机器学习中用于归因的普适框架,涵盖特征重要性、数据估值和因果推断。然而,其精确计算通常难以处理,需要高效的近似方法。虽然最有效且流行的估计器利用配对采样启发式方法来减少估计误差,但驱动这种改进的理论机制一直不明确。在本工作中,我们为配对采样提供了一个优雅且根本性的理论依据:我们证明了Shapley值仅依赖于集合函数的奇分量,并且配对采样通过正交化回归目标来滤除无关的偶分量。基于这一洞见,我们提出了OddSHAP——一种仅在奇子空间上进行多项式回归的新型一致估计器。通过利用傅里叶基来分离该子空间,并采用代理模型识别高影响力交互项,OddSHAP克服了高阶近似的组合爆炸问题。在广泛的基准评估中,我们发现OddSHAP实现了最先进的估计精度。