Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as infeasible to apply to large datasets. To address this issue, we propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. The proposed method is computationally efficient and can scale to millions of data by reusing trained weak learners. Specifically, Data-OOB takes less than 2.25 hours on a single CPU processor when there are $10^6$ samples to evaluate and the input dimension is 100. Furthermore, Data-OOB has solid theoretical interpretations in that it identifies the same important data point as the infinitesimal jackknife influence function when two different points are compared. We conduct comprehensive experiments using 12 classification datasets, each with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.
翻译:数据价值评估是一种强大的统计框架,用于揭示哪些数据对模型训练有益或有害。许多基于Shapley值的数据估值方法在各类下游任务中取得了显著成果,但众所周知,这些方法需要训练大量模型,计算成本极高,因此被认为难以应用于大规模数据集。为解决这一问题,我们提出Data-OOB——一种针对装袋模型的新颖数据估值方法,它利用袋外估计实现高效计算。该方法通过复用已训练的弱学习器,可扩展至百万级数据规模。具体而言,当评估$10^6$个样本且输入维度为100时,Data-OOB在单CPU处理器上的运行时间不超过2.25小时。此外,Data-OOB具有坚实的理论解释:在比较两个不同数据点时,它能够识别出与无穷小刀切影响函数相同的关键数据点。我们使用12个包含数千样本规模的分类数据集进行了全面实验,结果表明,该方法在识别错误标注数据及发现有益(或有害)数据点方面显著优于现有最先进的数据估值方法,凸显了数据价值在实际应用中的潜力。