Protecting Sensitive Data through Federated Co-Training

In many critical applications, sensitive data is inherently distributed. Federated learning trains a model collaboratively by aggregating the parameters of locally trained models. This avoids exposing sensitive local data. It is possible, though, to infer upon the sensitive data from the shared model parameters. At the same time, many types of machine learning models do not lend themselves to parameter aggregation, such as decision trees, or rule ensembles. It has been observed that in many applications, in particular healthcare, large unlabeled datasets are publicly available. They can be used to exchange information between clients by distributed distillation, i.e., co-regularizing local training via the discrepancy between the soft predictions of each local client on the unlabeled dataset. This, however, still discloses private information and restricts the types of models to those trainable via gradient-based methods. We propose to go one step further and use a form of federated co-training, where local hard labels on the public unlabeled datasets are shared and aggregated into a consensus label. This consensus label can be used for local training by any supervised machine learning model. We show that this federated co-training approach achieves a model quality comparable to both federated learning and distributed distillation on a set of benchmark datasets and real-world medical datasets. It improves privacy over both approaches, protecting against common membership inference attacks to the highest degree. Furthermore, we show that federated co-training can collaboratively train interpretable models, such as decision trees and rule ensembles, achieving a model quality comparable to centralized training.

翻译：在许多关键应用中，敏感数据天然具有分布式特性。联邦学习通过聚合本地训练模型的参数实现协同训练，从而避免暴露敏感本地数据。然而，仍有可能从共享模型参数中推断出敏感数据。与此同时，许多机器学习模型（如决策树或规则集成）并不适用于参数聚合。研究发现，在许多应用领域（尤其是医疗健康领域），存在大量公开的无标签数据集。这些数据集可通过分布式蒸馏方法实现客户端间的信息交换，即利用各客户端对无标签数据集的软预测差异来协同正则化本地训练。但这种方法仍会泄露隐私信息，并将模型类型限制为可通过梯度方法训练的模型。我们提出进一步改进方案，采用联邦协同训练形式：将各本地客户端对公开无标签数据集的硬标签进行共享与聚合，形成共识标签。该共识标签可用于任何监督机器学习模型的本地训练。实验表明，这种联邦协同训练方法在一系列基准数据集和真实医学数据集上实现了与联邦学习及分布式蒸馏相当的模型质量。相较于上述两种方法，该方法在隐私保护方面具有显著优势，能最大程度防御常见成员推断攻击。此外，联邦协同训练可协同训练决策树和规则集成等可解释模型，其模型质量与集中式训练相当。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/