Data constitute the foundational component of the data economy and its marketplaces. Efficient and fair data valuation has emerged as a topic of significant interest.\ Many approaches based on marginal contribution have shown promising results in various downstream tasks. However, they are well known to be computationally expensive as they require training a large number of utility functions, which are used to evaluate the usefulness or value of a given dataset for a specific purpose. As a result, it has been recognized as infeasible to apply these methods to a data marketplace involving large-scale datasets. Consequently, a critical issue arises: how can the re-training of the utility function be avoided? To address this issue, we propose a novel data valuation method from the perspective of optimal control, named the neural dynamic data valuation (NDDV). Our method has solid theoretical interpretations to accurately identify the data valuation via the sensitivity of the data optimal control state. In addition, we implement a data re-weighting strategy to capture the unique features of data points, ensuring fairness through the interaction between data points and the mean-field states. Notably, our method requires only training once to estimate the value of all data points, significantly improving the computational efficiency. We conduct comprehensive experiments using different datasets and tasks. The results demonstrate that the proposed NDDV method outperforms the existing state-of-the-art data valuation methods in accurately identifying data points with either high or low values and is more computationally efficient.
翻译:数据构成了数据经济及其市场的基础组成部分。高效且公平的数据估值已成为一个备受关注的重要课题。许多基于边际贡献的方法在各种下游任务中已展现出良好的效果。然而,这些方法因需要训练大量效用函数而众所周知计算成本高昂,这些效用函数用于评估给定数据集对特定用途的有用性或价值。因此,人们普遍认为将这些方法应用于涉及大规模数据集的数据市场是不可行的。随之产生了一个关键问题:如何避免效用函数的重新训练?为解决这一问题,我们从最优控制的视角提出了一种新颖的数据估值方法,称为神经动态数据估值(NDDV)。我们的方法具有坚实的理论解释,能够通过数据最优控制状态的敏感性来准确识别数据价值。此外,我们实施了数据重加权策略以捕捉数据点的独特特征,并通过数据点与平均场状态之间的交互确保公平性。值得注意的是,我们的方法仅需一次训练即可估计所有数据点的价值,显著提高了计算效率。我们使用不同数据集和任务进行了全面实验。结果表明,所提出的NDDV方法在准确识别高价值或低价值数据点方面优于现有的最先进数据估值方法,并且计算效率更高。