Over the recent years, Shapley value (SV), a solution concept from cooperative game theory, has found numerous applications in data analytics (DA). This paper provides the first comprehensive study of SV used throughout the DA workflow, clarifying the key variables in defining DA-applicable SV and the essential functionalities that SV can provide for data scientists. We condense four primary challenges of using SV in DA, namely computation efficiency, approximation error, privacy preservation, and interpretability, then disentangle the resolution techniques from existing arts in this field, analyze and discuss the techniques w.r.t. each challenge and potential conflicts between challenges. We also implement SVBench, a modular and extensible open-sourced framework for developing SV applications in different DA tasks, and conduct extensive evaluations to validate our analyses and discussions. Based on the qualitative and quantitative results, we identify the limitations of current efforts for applying SV to DA and highlight the directions of future research and engineering.
翻译:近年来,沙普利值(SV)作为合作博弈论中的一种解概念,已在数据分析(DA)领域得到广泛应用。本文首次对数据分析工作流中使用的沙普利值进行了全面研究,明确了定义适用于数据分析的沙普利值的关键变量,以及沙普利值能为数据科学家提供的重要功能。我们凝练出在数据分析中应用沙普利值的四大主要挑战,即计算效率、近似误差、隐私保护和可解释性,进而梳理了该领域现有研究中提出的解决技术,针对每个挑战分析讨论了相关技术以及不同挑战之间可能存在的冲突。我们还实现了SVBench——一个模块化、可扩展的开源框架,用于在不同数据分析任务中开发沙普利值应用,并通过大量实验评估验证了我们的分析和讨论。基于定性和定量结果,我们指出了当前将沙普利值应用于数据分析的研究存在的局限性,并展望了未来的研究方向和工程实践重点。