With the increasing use of cloud-based services for training and deploying machine learning models, data privacy has become a major concern. This is particularly important for natural language processing (NLP) models, which often process sensitive information such as personal communications and confidential documents. In this study, we propose a method for training NLP models on encrypted text data to mitigate data privacy concerns while maintaining similar performance to models trained on non-encrypted data. We demonstrate our method using two different architectures, namely Doc2Vec+XGBoost and Doc2Vec+LSTM, and evaluate the models on the 20 Newsgroups dataset. Our results indicate that both encrypted and non-encrypted models achieve comparable performance, suggesting that our encryption method is effective in preserving data privacy without sacrificing model accuracy. In order to replicate our experiments, we have provided a Colab notebook at the following address: https://t.ly/lR-TP
翻译:随着基于云的机器学习模型训练与部署服务的日益普及,数据隐私已成为一大关注焦点。这一问题对于自然语言处理(NLP)模型尤为突出,因为此类模型常需处理个人通信、机密文件等敏感信息。本研究提出一种在加密文本数据上训练NLP模型的方法,旨在缓解数据隐私问题的同时,保持与基于非加密数据训练的模型相当的性能。我们采用Doc2Vec+XGBoost与Doc2Vec+LSTM两种不同架构验证该方法,并在20 Newsgroups数据集上对模型进行评估。结果表明,加密模型与非加密模型性能相近,说明本文提出的加密方法能在不牺牲模型准确率的前提下有效保护数据隐私。为便于实验复现,我们提供了Colab笔记本,访问地址为:https://t.ly/lR-TP