Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as infeasible to apply to large datasets. To address this issue, we propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. The proposed method is computationally efficient and can scale to millions of data by reusing trained weak learners. Specifically, Data-OOB takes less than 2.25 hours on a single CPU processor when there are $10^6$ samples to evaluate and the input dimension is 100. Furthermore, Data-OOB has solid theoretical interpretations in that it identifies the same important data point as the infinitesimal jackknife influence function when two different points are compared. We conduct comprehensive experiments using 12 classification datasets, each with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.
翻译:数据价值是提供数据对模型训练有益或有害统计洞察的强大框架。许多基于Shapley值的数据估值方法在多种下游任务中展现出良好效果,但其需要训练大量模型的计算难题广为人知,因此被认为难以应用于大规模数据集。针对该问题,我们提出Data-OOB——一种基于袋装模型且利用袋外估计的数据估值新方法。该方法通过复用已训练的弱学习器实现高效计算,可扩展至百万级数据规模。具体而言,当评估样本数为$10^6$且输入维度为100时,Data-OOB在单CPU处理器上的耗时不足2.25小时。此外,该方法具有坚实的理论解释:在比较两个不同数据点时,其识别关键数据点的能力与无穷小刀切影响函数等价。我们采用12个分类数据集(每个数据集含数千样本量)进行综合实验,结果表明,该方法在识别错误标注数据和寻找有益(或有害)数据点方面显著优于现有最优数据估值方法,凸显了数据价值在实际应用中的潜力。