Traditionally, data valuation is posed as a problem of equitably splitting the validation performance of a learning algorithm among the training data. As a result, the calculated data values depend on many design choices of the underlying learning algorithm. However, this dependence is undesirable for many use cases of data valuation, such as setting priorities over different data sources in a data acquisition process and informing pricing mechanisms in a data marketplace. In these scenarios, data needs to be valued before the actual analysis and the choice of the learning algorithm is still undetermined then. Another side-effect of the dependence is that to assess the value of individual points, one needs to re-run the learning algorithm with and without a point, which incurs a large computation burden. This work leapfrogs over the current limits of data valuation methods by introducing a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. (1) We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between the training and the validation set. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions. (2) We develop a novel method to value individual data based on the sensitivity analysis of the class-wise Wasserstein distance. Importantly, these values can be directly obtained for free from the output of off-the-shelf optimization solvers when computing the distance. (3) We evaluate our new data valuation framework over various use cases related to detecting low-quality data and show that, surprisingly, the learning-agnostic feature of our framework enables a significant improvement over the state-of-the-art performance while being orders of magnitude faster.
翻译:传统上,数据估值被定义为在学习算法的验证性能中公平地分配训练数据的贡献。因此,计算得到的数据值取决于底层学习算法的许多设计选择。然而,这种依赖性对于数据估值的许多应用场景(例如在数据采集过程中对不同数据源的优先级排序,以及为数据市场中的定价机制提供依据)来说并不理想。在这些场景中,数据需在实际分析之前完成估值,而学习算法的选择此时尚未确定。这种依赖性的另一个副作用是,为评估单个数据点的价值,需要反复运行包含和不包含该数据点的学习算法,这带来了巨大的计算负担。本研究通过引入一种新框架,突破了当前数据估值方法的限制,该框架能够以一种与下游学习算法无关的方式对训练数据进行估值。(1)我们基于训练集与验证集之间一种非常规的类别级Wasserstein距离,建立了训练集对应验证性能的代理指标。我们证明,在特定Lipschitz条件下,该距离刻画了任意给定模型验证性能的上界。(2)我们开发了一种基于类别级Wasserstein距离敏感性分析的新方法,用于对单个数据进行估值。重要的是,这些估值结果可在通过现成优化求解器计算该距离时免费直接获得。(3)我们在与低质量数据检测相关的多种应用场景中评估了我们的新数据估值框架。结果表明,令人惊讶的是,该框架的学习无关特性使其在性能上显著超越现有最优方法,同时计算速度提升数个数量级。