We consider the dataset valuation problem, that is, the problem of quantifying the incremental gain, to some relevant pre-defined utility of a machine learning task, of aggregating an individual dataset to others. The Shapley value is a natural tool to perform dataset valuation due to its formal axiomatic justification, which can be combined with Monte Carlo integration to overcome the computational tractability challenges. Such generic approximation methods, however, remain expensive in some cases. In this paper, we exploit the knowledge about the structure of the dataset valuation problem to devise more efficient Shapley value estimators. We propose a novel approximation, referred to as discrete uniform Shapley, which is expressed as an expectation under a discrete uniform distribution with support of reasonable size. We justify the relevancy of the proposed framework via asymptotic and non-asymptotic theoretical guarantees and illustrate its benefits via an extensive set of numerical experiments.
翻译:本文研究数据集估值问题,即如何量化将单个数据集与其他数据集聚合时,对特定机器学习任务预设效用的增量增益。Shapley值因其公理化的理论基础,成为执行数据集估值的天然工具,通常结合蒙特卡洛积分以克服计算可处理性挑战。然而,此类通用近似方法在某些情况下仍显昂贵。本文通过挖掘数据集估值问题的结构特征,设计了更高效的Shapley值估计器。我们提出了一种称为离散均匀Shapley的新型近似方法,该方法可表示为支撑集规模合理的离散均匀分布下的期望。我们通过渐近与非渐近理论保证论证了所提框架的适用性,并通过大量数值实验验证了其优势。