The Shapley value provides a principled framework for fairly distributing rewards among participants according to their individual contributions. While prior work has applied this concept to data valuation in machine learning, existing formulations overwhelmingly assume that each participant contributes a fixed, deterministic dataset. In practice, however, data owners often provide samples drawn from underlying probabilistic distributions, introducing stochasticity into their marginal contributions and rendering the Shapley value itself a random variable. This work addresses this gap by proposing a framework for the Shapley value of probabilistic data distributions that quantifies both the expected contribution and the variance of each participant, thereby capturing uncertainty induced by random sampling. We develop theoretical and empirical methodologies for estimating these quantities: on the theoretical side, we derive unbiased estimators for the expectation and variance of the probabilistic Shapley value and analyze their statistical properties; on the empirical side, we introduce three Monte Carlo-based estimation algorithms - a baseline estimator using independent samples, a pooled estimator that improves efficiency through sample reuse, and a stratified pooled estimator that adaptively allocates sampling budget based on player-specific variability. Experiments on synthetic and real datasets demonstrate that these methods achieve strong accuracy-efficiency trade-offs, with the stratified pooled approach attaining substantial variance reduction at minimal additional cost. By extending Shapley value analysis from deterministic datasets to probabilistic data distributions, this work provides both theoretical rigor and practical tools for fair and reliable data valuation in modern stochastic data-sharing environments.
翻译:Shapley值为根据个体贡献公平分配参与者报酬提供了一个原则性框架。尽管先前研究已将该概念应用于机器学习中的数据估值,但现有方法绝大多数假设每个参与者贡献的是固定、确定性的数据集。然而在实践中,数据所有者通常提供从基础概率分布中抽取的样本,这给其边际贡献引入了随机性,并使Shapley值本身成为随机变量。本研究通过提出一个针对概率数据分布的Shapley值框架来填补这一空白,该框架同时量化每个参与者的期望贡献和方差,从而捕捉随机抽样引发的不确定性。我们开发了估计这些量的理论与实证方法:在理论层面,我们推导出概率Shapley值期望与方差的无偏估计量,并分析其统计特性;在实证层面,我们提出三种基于蒙特卡洛的估计算法——使用独立样本的基线估计器、通过样本复用提升效率的池化估计器,以及根据参与者特定变异性自适应分配采样预算的分层池化估计器。在合成与真实数据集上的实验表明,这些方法实现了优异的精度-效率权衡,其中分层池化方法以最小附加成本获得了显著的方差缩减。通过将Shapley值分析从确定性数据集扩展到概率数据分布,本研究为现代随机数据共享环境中的公平可靠数据估值提供了理论严谨性与实用工具。