Data is the main fuel of a successful machine learning model. A dataset may contain sensitive individual records e.g. personal health records, financial data, industrial information, etc. Training a model using this sensitive data has become a new privacy concern when someone uses third-party cloud computing. Trained models also suffer privacy attacks which leads to the leaking of sensitive information of the training data. This study is conducted to preserve the privacy of training data in the context of customer churn prediction modeling for the telecommunications industry (TCI). In this work, we propose a framework for privacy-preserving customer churn prediction (PPCCP) model in the cloud environment. We have proposed a novel approach which is a combination of Generative Adversarial Networks (GANs) and adaptive Weight-of-Evidence (aWOE). Synthetic data is generated from GANs, and aWOE is applied on the synthetic training dataset before feeding the data to the classification algorithms. Our experiments were carried out using eight different machine learning (ML) classifiers on three openly accessible datasets from the telecommunication sector. We then evaluated the performance using six commonly employed evaluation metrics. In addition to presenting a data privacy analysis, we also performed a statistical significance test. The training and prediction processes achieve data privacy and the prediction classifiers achieve high prediction performance (87.1\% in terms of F-Measure for GANs-aWOE based Na\"{\i}ve Bayes model). In contrast to earlier studies, our suggested approach demonstrates a prediction enhancement of up to 28.9\% and 27.9\% in terms of accuracy and F-measure, respectively.
翻译:数据是成功机器学习模型的主要驱动力。数据集可能包含敏感的个人记录,例如个人健康记录、财务数据、行业信息等。当使用第三方云计算时,利用此类敏感数据训练模型已成为新的隐私问题。训练完成的模型也可能遭受隐私攻击,导致训练数据中的敏感信息泄露。本研究旨在电信行业客户流失预测建模的背景下保护训练数据的隐私。本文提出了一种面向云环境的隐私保护客户流失预测模型框架。我们提出了一种结合生成对抗网络与自适应证据权重的新方法。该方法通过生成对抗网络生成合成数据,并在将数据输入分类算法前,对合成训练数据集应用自适应证据权重处理。我们使用八种不同的机器学习分类器,在三个公开可获取的电信领域数据集上进行了实验。随后采用六种常用评估指标对模型性能进行了评估。除提供数据隐私分析外,我们还进行了统计显著性检验。该方法的训练与预测过程实现了数据隐私保护,且预测分类器取得了优异的预测性能(基于生成对抗网络-自适应证据权重的朴素贝叶斯模型在F值指标上达到87.1%)。与先前研究相比,我们提出的方法在准确率和F值指标上分别实现了最高28.9%和27.9%的预测性能提升。