Data valuation has found various applications in machine learning, such as data filtering, efficient learning and incentives for data sharing. The most popular current approach to data valuation is the Shapley value. While popular for its various applications, Shapley value is computationally expensive even to approximate, as it requires repeated iterations of training models on different subsets of data. In this paper we show that the Shapley value of data points can be approximated more efficiently by leveraging the structural properties of machine learning problems. We derive convergence guarantees on the accuracy of the approximate Shapley value for different learning settings including Stochastic Gradient Descent with convex and non-convex loss functions. Our analysis suggests that in fact models trained on small subsets are more important in the context of data valuation. Based on this idea, we describe $\delta$-Shapley -- a strategy of only using small subsets for the approximation. Experiments show that this approach preserves approximate value and rank of data, while achieving speedup of up to 9.9x. In pre-trained networks the approach is found to bring more efficiency in terms of accurate evaluation using small subsets.
翻译:数据估值在机器学习中具有多种应用,例如数据过滤、高效学习和数据共享激励。当前最流行的数据估值方法是沙普利值。尽管因其广泛应用而受欢迎,但沙普利值的计算成本极高,即使近似计算也需要在数据的不同子集上反复迭代训练模型。本文表明,通过利用机器学习问题的结构特性,可以更高效地近似计算数据点的沙普利值。我们针对不同学习设置(包括凸损失函数和非凸损失函数的随机梯度下降法)推导了近似沙普利值精度的收敛性保证。分析表明,在数据估值背景下,基于小子集训练的模型实际上更为重要。基于这一思想,我们提出了δ-沙普利——一种仅使用小子集进行近似计算的策略。实验表明,该方法在保持数据近似价值和排序的同时,实现了高达9.9倍的加速比。在预训练网络中,该方法通过使用小子集进行精确评估,进一步提升了效率。