As data emerges as a vital driver of technological and economic advancements, a key challenge is accurately quantifying its value in algorithmic decision-making. The Shapley value, a well-established concept from cooperative game theory, has been widely adopted to assess the contribution of individual data sources in supervised machine learning. However, its symmetry axiom assumes all players in the cooperative game are homogeneous, which overlooks the complex structures and dependencies present in real-world datasets. To address this limitation, we extend the traditional data Shapley framework to asymmetric data Shapley, making it flexible enough to incorporate inherent structures within the datasets for structure-aware data valuation. We also introduce an efficient $k$-nearest neighbor-based algorithm for its exact computation. We demonstrate the practical applicability of our framework across various machine learning tasks and data market contexts. The code is available at: https://github.com/xzheng01/Asymmetric-Data-Shapley.
翻译:随着数据成为技术和经济发展的关键驱动力,如何精确量化其在算法决策中的价值已成为核心挑战。Shapley值作为合作博弈论中的成熟概念,已被广泛用于评估监督式机器学习中个体数据源的贡献。然而,其对称性公理假设合作博弈中的所有参与者具有同质性,这忽视了现实世界数据集中存在的复杂结构与依赖关系。为突破这一局限,我们将传统数据Shapley框架扩展为非对称数据Shapley,使其能灵活纳入数据集的固有结构,实现结构感知的数据价值评估。同时,我们提出了一种基于$k$近邻的高效精确计算算法。通过在多种机器学习任务和数据市场场景中的验证,我们证明了该框架的实际适用性。代码已开源:https://github.com/xzheng01/Asymmetric-Data-Shapley。