Obtaining high-quality data for collaborative training of machine learning models can be a challenging task due to A) the regulatory concerns and B) lack of incentive to participate. The first issue can be addressed through the use of privacy enhancing technologies (PET), one of the most frequently used one being differentially private (DP) training. The second challenge can be addressed by identifying which data points can be beneficial for model training and rewarding data owners for sharing this data. However, DP in deep learning typically adversely affects atypical (often informative) data samples, making it difficult to assess the usefulness of individual contributions. In this work we investigate how to leverage gradient information to identify training samples of interest in private training settings. We show that there exist techniques which are able to provide the clients with the tools for principled data selection even in strictest privacy settings.
翻译:获取高质量数据以进行机器学习模型的协作训练,可能是一项具有挑战性的任务,原因包括A)监管问题以及B)缺乏参与激励。第一个问题可以通过使用隐私增强技术(PET)来解决,其中最常用的一种是差分隐私(DP)训练。第二个挑战可以通过识别哪些数据点对模型训练有益并奖励数据所有者分享这些数据来解决。然而,深度学习中的差分隐私通常会对非典型(通常信息丰富)的数据样本产生不利影响,使得评估个体贡献的有用性变得困难。在这项工作中,我们研究了如何利用梯度信息在私有训练环境中识别感兴趣的训练样本。我们表明,存在一些技术能够为客户端提供即使在最严格的隐私设置下也能进行原则性数据选择的工具。