Obtaining high-quality data for collaborative training of machine learning models can be a challenging task due to A) the regulatory concerns and B) lack of incentive to participate. The first issue can be addressed through the use of privacy enhancing technologies (PET), one of the most frequently used one being differentially private (DP) training. The second challenge can be addressed by identifying which data points can be beneficial for model training and rewarding data owners for sharing this data. However, DP in deep learning typically adversely affects atypical (often informative) data samples, making it difficult to assess the usefulness of individual contributions. In this work we investigate how to leverage gradient information to identify training samples of interest in private training settings. We show that there exist techniques which are able to provide the clients with the tools for principled data selection even in strictest privacy settings.
翻译:获取用于机器学习模型协作训练的高质量数据是一项具有挑战性的任务,原因包括:A)监管方面的考量,以及B)缺乏参与激励。第一个问题可通过使用隐私增强技术(PET)来解决,其中最常用的一种是差分隐私(DP)训练。第二个挑战可通过识别哪些数据点对模型训练有益,并奖励数据所有者共享这些数据来解决。然而,深度学习中的差分隐私通常会对异常(通常信息丰富)的数据样本产生不利影响,从而难以评估单个贡献的有用性。在本工作中,我们研究如何在私有训练设置中利用梯度信息来识别感兴趣的训练样本。我们证明,即使在最严格的隐私设置下,也存在能够为客户提供原则性数据选择工具的技术。