Scalable Data Point Valuation in Decentralized Learning

Existing research on data valuation in federated and swarm learning focuses on valuing client contributions and works best when data across clients is independent and identically distributed (IID). In practice, data is rarely distributed IID. We develop an approach called DDVal for decentralized data valuation, capable of valuing individual data points in federated and swarm learning. DDVal is based on sharing deep features and approximating Shapley values through a k-nearest neighbor approximation method. This allows for novel applications, for example, to simultaneously reward institutions and individuals for providing data to a decentralized machine learning task. The valuation of data points through DDVal allows to also draw hierarchical conclusions on the contribution of institutions, and we empirically show that the accuracy of DDVal in estimating institutional contributions is higher than existing Shapley value approximation methods for federated learning. Specifically, it reaches a cosine similarity in approximating Shapley values of 99.969 % in both, IID and non-IID data distributions across institutions, compared with 99.301 % and 97.250 % for the best state of the art methods. DDVal scales with the number of data points instead of the number of clients, and has a loglinear complexity. This scales more favorably than existing approaches with an exponential complexity. We show that DDVal is especially efficient in data distribution scenarios with many clients that have few data points - for example, more than 16 clients with 8,000 data points each. By integrating DDVal into a decentralized system, we show that it is not only suitable for centralized federated learning, but also decentralized swarm learning, which aligns well with the research on emerging internet technologies such as web3 to reward users for providing data to algorithms.

翻译：现有关于联邦学习和群体学习中数据估值的研究主要聚焦于评估客户贡献，且在客户间数据独立同分布（IID）时效果最佳。然而在实际场景中，数据极少呈现独立同分布特征。我们提出了一种名为DDVal的分散式数据估值方法，能够对联邦学习和群体学习中的单个数据点进行估值。DDVal基于深度特征共享，并通过k近邻近似方法逼近沙普利值。这使得其具备新颖的应用场景，例如可同时奖励向分散式机器学习任务提供数据的机构和个人。通过DDVal对数据点进行估值，还能对机构贡献进行分层推断。实验表明，DDVal在估算机构贡献时的准确率高于现有的联邦学习沙普利值近似方法——在跨机构数据服从IID和非IID分布时，其沙普利值近似的余弦相似度分别达到99.969%，而现有最优方法的对应数值仅为99.301%和97.250%。DDVal的复杂度随数据点数量（而非客户数量）呈对数线性增长，相较复杂度呈指数级增长的现有方法具有更优的可扩展性。我们证明，DDVal在数据分布场景中特别高效，尤其是当存在大量客户端且每个客户端仅有少量数据点时（例如超过16个客户端，每个客户端有8,000个数据点）。通过将DDVal集成到分散式系统中，我们发现它不仅适用于集中式联邦学习，也适用于分散式群体学习，这与Web3等新兴互联网技术中奖励用户向算法提供数据的研究方向高度契合。