Shapley and Banzhaf interactions capture the complex dynamics inherent in modern machine learning applications. However, current estimators for these higher-order interactions trade off between speed and accuracy. To overcome this limitation, we introduce ProxySHAP. ProxySHAP reconciles the high sample efficiency of tree-based proxy models with a principled path to consistency via residual correction. On a theoretical level, we derive a polynomial-time generalization of interventional TreeSHAP to compute exact interaction indices for tree ensembles, successfully bypassing exponential tree-depth dependencies in prior methods. Furthermore, we formally analyze the residual adjustment strategy, characterizing the specific conditions under which Maximum Sample Reuse (MSR) corrects proxy bias without its variance scaling exponentially with interaction size. Extensive benchmarking demonstrates that ProxySHAP sets a new state-of-the-art standard for approximation quality, including in large-scale applications with thousands of features. By achieving the lowest error in both small- and large-budget regimes, ProxySHAP significantly outperforms the prior best estimators ProxySPEX and KernelSHAP-IQ, while also delivering superior performance on downstream explainability tasks.
翻译:Shapley和Banzhaf交互作用能够捕捉现代机器学习应用中固有的复杂动态特性。然而,当前针对这些高阶交互作用的估计方法在速度与精度之间存在权衡。为突破这一局限,我们提出了ProxySHAP方法。ProxySHAP既保留了基于树的代理模型的高样本效率,又通过残差校正为其提供了一条通向一致性的理论路径。在理论层面,我们推导了介入式TreeSHAP的多项式时间复杂度泛化版本,用于精确计算树集成模型的交互指数,成功规避了先前方法中指数级依赖树深度的计算瓶颈。此外,我们正式分析了残差调整策略,刻画了最大样本复用(MSR)在校正代理偏差时所需满足的具体条件——其方差不会随交互规模呈指数增长。广泛基准测试表明,ProxySHAP在近似质量上树立了新标杆,在包含数千特征的大规模应用场景中同样表现优异。通过在小预算与大预算区间内均实现最低误差,ProxySHAP显著超越先前最优估计器ProxySPEX和KernelSHAP-IQ,同时在下游可解释性任务中展现出更卓越的性能。