Many machine learning problems require performing dataset valuation, i.e. to quantify the incremental gain, to some relevant pre-defined utility, of aggregating an individual dataset to others. As seminal examples, dataset valuation has been leveraged in collaborative and federated learning to create incentives for data sharing across several data owners. The Shapley value has recently been proposed as a principled tool to achieve this goal due to formal axiomatic justification. Since its computation often requires exponential time, standard approximation strategies based on Monte Carlo integration have been considered. Such generic approximation methods, however, remain expensive in some cases. In this paper, we exploit the knowledge about the structure of the dataset valuation problem to devise more efficient Shapley value estimators. We propose a novel approximation of the Shapley value, referred to as discrete uniform Shapley (DU-Shapley) which is expressed as an expectation under a discrete uniform distribution with support of reasonable size. We justify the relevancy of the proposed framework via asymptotic and non-asymptotic theoretical guarantees and show that DU-Shapley tends towards the Shapley value when the number of data owners is large. The benefits of the proposed framework are finally illustrated on several dataset valuation benchmarks. DU-Shapley outperforms other Shapley value approximations, even when the number of data owners is small.
翻译:许多机器学习问题需要执行数据集估值,即量化将某个个体数据集添加到其他数据集时,对相关预定义效用的增量增益。作为开创性示例,数据集估值已应用于协作学习和联邦学习,以激励多个数据所有者之间的数据共享。由于具有形式化的公理基础,夏普利值近期被提出作为实现该目标的原则性工具。由于其计算通常需要指数级时间,基于蒙特卡洛积分的标准近似策略已被采用。然而,在某些情况下,这种通用近似方法仍然成本高昂。本文利用数据集估值问题的结构知识,设计更高效的夏普利值估计器。我们提出一种新的夏普利值近似方法,称为离散均匀夏普利(DU-Shapley),其表示为在支撑集规模合理的离散均匀分布下的期望。通过渐近与非渐近理论保证,我们证明了所提框架的合理性,并表明当数据所有者数量较大时,DU-Shapley趋近于夏普利值。最后,在多个数据集估值基准上验证了所提框架的优势。即使数据所有者数量较少,DU-Shapley仍优于其他夏普利值近似方法。