A common way to evaluate a dataset in ML involves training a model on this dataset and assessing the model's performance on a test set. However, this approach has two issues: (1) it may incentivize undesirable data manipulation in data marketplaces, as the self-interested data providers seek to modify the dataset to maximize their evaluation scores; (2) it may select datasets that overfit to potentially small test sets. We propose a new data valuation method that provably guarantees the following: data providers always maximize their expected score by truthfully reporting their observed data. Any manipulation of the data, including but not limited to data duplication, adding random data, data removal, or re-weighting data from different groups, cannot increase their expected score. Our method, following the paradigm of proper scoring rules, measures the pointwise mutual information (PMI) of the test dataset and the evaluated dataset. However, computing the PMI of two datasets is challenging. We introduce a novel PMI measuring method that greatly improves tractability within Bayesian machine learning contexts. This is accomplished through a new characterization of PMI that relies solely on the posterior probabilities of the model parameter at an arbitrarily selected value. Finally, we support our theoretical results with simulations and further test the effectiveness of our data valuation method in identifying the top datasets among multiple data providers. Interestingly, our method outperforms the standard approach of selecting datasets based on the trained model's test performance, suggesting that our truthful valuation score can also be more robust to overfitting.
翻译:在机器学习中,评估数据集的常见方法是基于该数据集训练模型,并在测试集上评估模型性能。然而,这种方法存在两个问题:(1) 在数据市场中,它可能激励不良的数据操纵行为,因为自利的数据提供者会试图修改数据集以最大化其评估分数;(2) 它可能选择对潜在小型测试集过拟合的数据集。我们提出了一种新的数据估值方法,该方法可证明地保证以下特性:数据提供者始终通过如实报告其观测数据来最大化其期望分数。任何数据操纵行为,包括但不限于数据复制、添加随机数据、数据删除或对不同组别数据进行重新加权,均无法提高其期望分数。我们的方法遵循适当评分规则的范式,通过测量测试数据集与待评估数据集之间的点互信息(PMI)来实现这一目标。然而,计算两个数据集的PMI具有挑战性。我们引入了一种新颖的PMI测量方法,在贝叶斯机器学习背景下显著提升了计算可行性。这是通过建立一种新的PMI表征方式实现的,该表征仅依赖于模型参数在任意选定值处的后验概率。最后,我们通过仿真验证了理论结果,并进一步测试了所提数据估值方法在多个数据提供者中识别最优数据集的有效性。有趣的是,我们的方法优于基于训练模型测试性能选择数据集的标准方法,这表明我们的真实估值分数对过拟合问题也具有更强的鲁棒性。