Protecting data privacy is paramount in the fields such as finance, banking, and healthcare. Federated Learning (FL) has attracted widespread attention due to its decentralized, distributed training and the ability to protect the privacy while obtaining a global shared model. However, FL presents challenges such as communication overhead, and limited resource capability. This motivated us to propose a two-stage federated learning approach toward the objective of privacy protection, which is a first-of-its-kind study as follows: (i) During the first stage, the synthetic dataset is generated by employing two different distributions as noise to the vanilla conditional tabular generative adversarial neural network (CTGAN) resulting in modified CTGAN, and (ii) In the second stage, the Federated Probabilistic Neural Network (FedPNN) is developed and employed for building globally shared classification model. We also employed synthetic dataset metrics to check the quality of the generated synthetic dataset. Further, we proposed a meta-clustering algorithm whereby the cluster centers obtained from the clients are clustered at the server for training the global model. Despite PNN being a one-pass learning classifier, its complexity depends on the training data size. Therefore, we employed a modified evolving clustering method (ECM), another one-pass algorithm to cluster the training data thereby increasing the speed further. Moreover, we conducted sensitivity analysis by varying Dthr, a hyperparameter of ECM at the server and client, one at a time. The effectiveness of our approach is validated on four finance and medical datasets.
翻译:保护数据隐私在金融、银行和医疗等领域至关重要。联邦学习因其去中心化、分布式训练以及在获取全局共享模型的同时保护隐私的能力而受到广泛关注。然而,联邦学习面临着通信开销和资源能力有限等挑战。为此,我们提出了一种面向隐私保护目标的两阶段联邦学习方法,这是该领域的首创性研究:(i)在第一阶段,通过向普通条件表格生成对抗网络(CTGAN)添加两种不同分布的噪声生成合成数据集,形成改进型CTGAN;(ii)在第二阶段,开发并采用联邦概率神经网络(FedPNN)构建全局共享分类模型。我们还使用合成数据集指标检验生成合成数据集的质量。此外,我们提出了一种元聚类算法,将客户端获得的聚类中心在服务器端进行聚类以训练全局模型。尽管PNN是一种单次学习分类器,但其复杂度依赖于训练数据规模。因此,我们采用了一种改进的演化聚类方法(ECM)——另一种单次算法来对训练数据进行聚类,从而进一步提升速度。此外,我们通过逐一改变服务器和客户端的ECM超参数Dthr进行了敏感性分析。本方法的有效性在四个金融和医疗数据集上得到了验证。