Data valuation has become an increasingly significant discipline in data science due to the economic value of data. In the context of machine learning (ML), data valuation methods aim to equitably measure the contribution of each data point to the utility of an ML model. One prevalent method is Shapley value, which helps identify data points that are beneficial or detrimental to an ML model. However, traditional Shapley-based data valuation methods may not effectively distinguish between beneficial and detrimental training data points for probabilistic classifiers. In this paper, we propose Probabilistic Shapley (P-Shapley) value by constructing a probability-wise utility function that leverages the predicted class probabilities of probabilistic classifiers rather than binarized prediction results in the traditional Shapley value. We also offer several activation functions for confidence calibration to effectively quantify the marginal contribution of each data point to the probabilistic classifiers. Extensive experiments on four real-world datasets demonstrate the effectiveness of our proposed P-Shapley value in evaluating the importance of data for building a high-usability and trustworthy ML model.
翻译:数据估值因数据的经济价值而成为数据科学中日益重要的学科。在机器学习背景下,数据估值方法旨在公平衡量每个数据点对机器学习模型效用的贡献。沙普利值是一种常用方法,有助于识别对机器学习模型有益或有害的数据点。然而,传统的基于沙普利的数据估值方法可能无法有效区分概率分类器中有利和不利的训练数据点。本文通过构建概率级效用函数,提出概率沙普利(P-Shapley)值。该函数利用概率分类器预测的类概率,而非传统沙普利值中的二值化预测结果。我们还提供了几种用于置信度校准的激活函数,以有效量化每个数据点对概率分类器的边际贡献。在四个真实数据集上的大量实验表明,我们提出的P-Shapley值在评估数据对构建高可用性和可信机器学习模型的重要性方面具有有效性。