In many critical applications, sensitive data is inherently distributed. Federated learning trains a model collaboratively by aggregating the parameters of locally trained models. This avoids exposing sensitive local data. It is possible, though, to infer upon the sensitive data from the shared model parameters. At the same time, many types of machine learning models do not lend themselves to parameter aggregation, such as decision trees, or rule ensembles. It has been observed that in many applications, in particular healthcare, large unlabeled datasets are publicly available. They can be used to exchange information between clients by distributed distillation, i.e., co-regularizing local training via the discrepancy between the soft predictions of each local client on the unlabeled dataset. This, however, still discloses private information and restricts the types of models to those trainable via gradient-based methods. We propose to go one step further and use a form of federated co-training, where local hard labels on the public unlabeled datasets are shared and aggregated into a consensus label. This consensus label can be used for local training by any supervised machine learning model. We show that this federated co-training approach achieves a model quality comparable to both federated learning and distributed distillation on a set of benchmark datasets and real-world medical datasets. It improves privacy over both approaches, protecting against common membership inference attacks to the highest degree. Furthermore, we show that federated co-training can collaboratively train interpretable models, such as decision trees and rule ensembles, achieving a model quality comparable to centralized training.
翻译:在许多关键应用中,敏感数据天然具有分布式特性。联邦学习通过聚合本地训练模型的参数实现协同训练,从而避免暴露敏感本地数据。然而,仍有可能从共享模型参数中推断出敏感数据。与此同时,许多机器学习模型(如决策树或规则集成)并不适用于参数聚合。研究发现,在许多应用领域(尤其是医疗健康领域),存在大量公开的无标签数据集。这些数据集可通过分布式蒸馏方法实现客户端间的信息交换,即利用各客户端对无标签数据集的软预测差异来协同正则化本地训练。但这种方法仍会泄露隐私信息,并将模型类型限制为可通过梯度方法训练的模型。我们提出进一步改进方案,采用联邦协同训练形式:将各本地客户端对公开无标签数据集的硬标签进行共享与聚合,形成共识标签。该共识标签可用于任何监督机器学习模型的本地训练。实验表明,这种联邦协同训练方法在一系列基准数据集和真实医学数据集上实现了与联邦学习及分布式蒸馏相当的模型质量。相较于上述两种方法,该方法在隐私保护方面具有显著优势,能最大程度防御常见成员推断攻击。此外,联邦协同训练可协同训练决策树和规则集成等可解释模型,其模型质量与集中式训练相当。