As data plays an increasingly pivotal role in decision-making, the emergence of data markets underscores the growing importance of data valuation. Within the machine learning landscape, Data Shapley stands out as a widely embraced method for data valuation. However, a limitation of Data Shapley is its assumption of a fixed dataset, contrasting with the dynamic nature of real-world applications where data constantly evolves and expands. This paper establishes the relationship between Data Shapley and infinite-order U-statistics and addresses this limitation by quantifying the uncertainty of Data Shapley with changes in data distribution from the perspective of U-statistics. We make statistical inferences on data valuation to obtain confidence intervals for the estimations. We construct two different algorithms to estimate this uncertainty and provide recommendations for their applicable situations. We also conduct a series of experiments on various datasets to verify asymptotic normality and propose a practical trading scenario enabled by this method.
翻译:随着数据在决策中扮演日益关键的角色,数据市场的兴起突显了数据估值的重要性。在机器学习领域,Data Shapley作为一种被广泛采纳的数据估值方法脱颖而出。然而,Data Shapley的一个局限在于其假设数据集是固定的,这与现实应用中数据持续演变和扩展的动态特性形成对比。本文建立了Data Shapley与无限阶U统计量之间的关系,并从U统计量的角度出发,通过量化数据分布变化时Data Shapley的不确定性来解决这一局限。我们对数据估值进行统计推断以获得估计值的置信区间。我们构建了两种不同的算法来估计这种不确定性,并提供了它们适用场景的建议。我们还在多个数据集上进行了一系列实验以验证渐近正态性,并提出了由该方法实现的一种实际交易场景。