Traditionally, data valuation (DV) is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many DV use cases, such as setting priorities over different data sources in a data acquisition process and informing pricing mechanisms in a data marketplace. In these scenarios, data needs to be valued before the actual analysis and the choice of the learning algorithm is still undetermined then. Another side-effect of the dependence is that to assess the value of individual points, one needs to re-run the learning algorithm with and without a point, which incurs a large computation burden. This work leapfrogs over the current limits of data valuation methods by introducing a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. Our main results are as follows. (1) We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions. (2) We develop a novel method to value individual data based on the sensitivity analysis of the class-wise Wasserstein distance. Importantly, these values can be directly obtained for free from the output of off-the-shelf optimization solvers when computing the distance. (3) We evaluate our new data valuation framework over various use cases related to detecting low-quality data and show that, surprisingly, the learning-agnostic feature of our framework enables a significant improvement over SOTA performance while being orders of magnitude faster.
翻译:传统上,数据估值(DV)被定义为在训练数据间公平分配学习算法验证性能的问题。因此,计算出的数据值依赖于底层学习算法的诸多设计选择。然而,这种依赖性在许多DV应用场景中并不理想,例如在数据采集过程中对不同数据源设定优先级,以及为数据市场中的定价机制提供参考信息。在这些场景中,数据需在实际分析前完成估值,而学习算法的选择此时尚未确定。依赖性的另一副作用在于,为评估单个数据点的价值,需要分别执行包含和不包含该数据点的学习算法,这带来了巨大的计算负担。本研究通过引入一种新框架,突破当前数据估值方法的局限,使得训练数据的估值过程与下游学习算法无关。我们的主要成果如下:(1)基于训练集与验证集之间非常规的类别级Wasserstein距离,开发了一个与验证集关联的验证性能代理指标。我们证明,在特定Lipschitz条件下,该距离可表征任意模型验证性能的上界。(2)通过分析类别级Wasserstein距离的敏感性,提出了一种创新性的个体数据估值方法。更重要的是,在计算距离时,可直接从现成优化求解器的输出中免费获取这些估值。(3)我们在与低质量数据检测相关的多种应用场景中评估了新框架,结果表明,令人惊讶的是,该框架的学习无关特性在实现比当前最优性能显著提升的同时,计算速度还提高了数个数量级。