Data valuation, especially quantifying data value in algorithmic prediction and decision-making, is a fundamental problem in data trading scenarios. The most widely used method is to define the data Shapley and approximate it by means of the permutation sampling algorithm. To make up for the large estimation variance of the permutation sampling that hinders the development of the data marketplace, we propose a more robust data valuation method using stratified sampling, named variance reduced data Shapley (VRDS for short). We theoretically show how to stratify, how many samples are taken at each stratum, and the sample complexity analysis of VRDS. Finally, the effectiveness of VRDS is illustrated in different types of datasets and data removal applications.
翻译:数据估值,特别是在算法预测和决策中量化数据价值,是数据交易场景中的基本问题。最广泛使用的方法是通过排列采样算法定义数据Shapley值并对其进行近似计算。为弥补排列采样算法估计方差较大、阻碍数据市场发展的缺陷,我们提出一种基于分层采样的更稳健的数据估值方法,命名为方差缩减数据Shapley(简称VRDS)。我们理论上证明了如何分层、每层采样数量以及VRDS的样本复杂度分析。最后,通过不同类型数据集和数据移除应用验证了VRDS的有效性。