Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as infeasible to apply to large datasets. To address this issue, we propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. The proposed method is computationally efficient and can scale to millions of data by reusing trained weak learners. Specifically, Data-OOB takes less than 2.25 hours on a single CPU processor when there are $10^6$ samples to evaluate and the input dimension is 100. Furthermore, Data-OOB has solid theoretical interpretations in that it identifies the same important data point as the infinitesimal jackknife influence function when two different points are compared. We conduct comprehensive experiments using 12 classification datasets, each with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.
翻译:数据价值评估是一种强大的统计框架,用于揭示哪些数据对模型训练有益或有害。许多基于Shapley值的数据估值方法已在各类下游任务中展现出良好效果,然而这些方法因需要训练大量模型而面临公认的计算难题,导致其难以应用于大规模数据集。为解决这一问题,我们提出Data-OOB——一种针对Bagging模型、利用袋外估计的新型数据价值评估方法。该方法计算效率高,可通过复用已训练的弱学习器扩展至百万级数据规模。具体而言,当评估样本量达$10^6$且输入维度为100时,Data-OOB在单CPU处理器上的耗时不足2.25小时。此外,Data-OOB具有坚实的理论解释:在比较两个不同数据点时,其识别重要数据点的方式与无穷小刀切影响函数具有一致性。我们使用12个分类数据集(每类含数千样本量)开展全面实验,结果表明该方法在识别错误标注数据、发现有益(或有害)数据子集等任务中显著优于现有最优数据估值方法,凸显了数据价值在真实应用中的潜力。