Data valuation is an essential task in a data marketplace. It aims at fairly compensating data owners for their contribution. There is increasing recognition in the machine learning community that the Shapley value -- a foundational profit-sharing scheme in cooperative game theory -- has major potential to value data, because it uniquely satisfies basic properties for fair credit allocation and has been shown to be able to identify data sources that are useful or harmful to model performance. However, calculating the Shapley value requires accessing original data sources. It still remains an open question how to design a real-world data marketplace that takes advantage of the Shapley value-based data pricing while protecting privacy and allowing fair payments. In this paper, we propose the {\em first} prototype of a data marketplace that values data sources based on the Shapley value in a privacy-preserving manner and at the same time ensures fair payments. Our approach is enabled by a suite of innovations on both algorithm and system design. We firstly propose a Shapley value calculation algorithm that can be efficiently implemented via multiparty computation (MPC) circuits. The key idea is to learn a performance predictor that can directly predict model performance corresponding to an input dataset without performing actual training. We further optimize the MPC circuit design based on the structure of the performance predictor. We further incorporate fair payment into the MPC circuit to guarantee that the data that the buyer pays for is exactly the same as the one that has been valuated. Our experimental results show that the proposed new data valuation algorithm is as effective as the original expensive one. Furthermore, the customized MPC protocol is efficient and scalable.
翻译:数据估值是数据市场中的一项关键任务,旨在公平补偿数据贡献者的付出。机器学习领域日益认识到,基于合作博弈论的基础利润分配方案——沙普利值——在数据估值方面具有巨大潜力,因为它唯一满足了公平信用分配的基本属性,且已被证明能够识别对模型性能有益或有害的数据源。然而,计算沙普利值需要访问原始数据源。如何设计一个既能利用基于沙普利值的数据定价,又能保护隐私并实现公平支付的真实世界数据市场,仍是一个待解决的开放问题。本文提出了首个数据市场原型,该原型基于沙普利值以隐私保护方式对数据源进行估值,同时确保公平支付。我们的方法得益于算法与系统设计的一系列创新。首先,我们提出了一种沙普利值计算算法,该算法可通过多方安全计算(MPC)电路高效实现。其核心思想是学习一个性能预测器,能够在不执行实际训练的情况下直接预测输入数据集对应的模型性能。进一步地,我们基于该性能预测器的结构优化了MPC电路设计。此外,我们将公平支付机制集成到MPC电路中,以确保买方支付的数据与估值的数据完全一致。实验结果表明,所提出的新数据估值算法与原始高成本算法同样有效,且定制化的MPC协议具备高效性和可扩展性。