Neural machine translation (NMT) is a widely popular text generation task, yet there is a considerable research gap in the development of privacy-preserving NMT models, despite significant data privacy concerns for NMT systems. Differentially private stochastic gradient descent (DP-SGD) is a popular method for training machine learning models with concrete privacy guarantees; however, the implementation specifics of training a model with DP-SGD are not always clarified in existing models, with differing software libraries used and code bases not always being public, leading to reproducibility issues. To tackle this, we introduce DP-NMT, an open-source framework for carrying out research on privacy-preserving NMT with DP-SGD, bringing together numerous models, datasets, and evaluation metrics in one systematic software package. Our goal is to provide a platform for researchers to advance the development of privacy-preserving NMT systems, keeping the specific details of the DP-SGD algorithm transparent and intuitive to implement. We run a set of experiments on datasets from both general and privacy-related domains to demonstrate our framework in use. We make our framework publicly available and welcome feedback from the community.
翻译:神经机器翻译(NMT)是一项广泛流行的文本生成任务,然而尽管NMT系统存在显著的数据隐私问题,在隐私保护型NMT模型的发展方面仍存在相当大的研究空白。差分隐私随机梯度下降(DP-SGD)是一种能够提供具体隐私保证的机器学习模型训练常用方法;然而,现有模型在DP-SGD训练的实现细节上往往不够明确,使用不同软件库且代码库并非总是公开,从而导致可复现性问题。为解决这一问题,我们提出DP-NMT——一个基于DP-SGD进行隐私保护NMT研究的开源框架,该框架将众多模型、数据集和评估指标整合到统一的系统化软件包中。我们的目标是构建一个可供研究人员推动隐私保护型NMT系统发展的平台,同时确保DP-SGD算法的具体实现细节透明且易于理解。我们在通用领域和隐私相关领域的数据集上开展了一系列实验以展示框架的实际应用。我们已将框架公开发布,并欢迎社区提供反馈意见。