Neural machine translation (NMT) is a widely popular text generation task, yet there is a considerable research gap in the development of privacy-preserving NMT models, despite significant data privacy concerns for NMT systems. Differentially private stochastic gradient descent (DP-SGD) is a popular method for training machine learning models with concrete privacy guarantees; however, the implementation specifics of training a model with DP-SGD are not always clarified in existing models, with differing software libraries used and code bases not always being public, leading to reproducibility issues. To tackle this, we introduce DP-NMT, an open-source framework for carrying out research on privacy-preserving NMT with DP-SGD, bringing together numerous models, datasets, and evaluation metrics in one systematic software package. Our goal is to provide a platform for researchers to advance the development of privacy-preserving NMT systems, keeping the specific details of the DP-SGD algorithm transparent and intuitive to implement. We run a set of experiments on datasets from both general and privacy-related domains to demonstrate our framework in use. We make our framework publicly available and welcome feedback from the community.
翻译:神经机器翻译(NMT)是一项广泛流行的文本生成任务,然而,尽管NMT系统存在显著的数据隐私问题,但在隐私保护型NMT模型开发方面仍存在较大的研究空白。差分隐私随机梯度下降(DP-SGD)是一种具有确切隐私保障的机器学习模型训练方法;然而,现有模型中关于使用DP-SGD训练模型的具体实现细节并未得到充分明确,不同的软件库被采用且代码库并非总是公开,这导致了可重复性问题。为解决这一问题,我们提出了DP-NMT,这是一个用于开展基于DP-SGD的隐私保护型NMT研究的开源框架,它将众多模型、数据集和评估指标整合在一个系统化的软件包中。我们的目标是为研究人员提供一个平台,以推动隐私保护型NMT系统的发展,同时保持DP-SGD算法的具体细节透明且易于实现。我们在通用领域和隐私相关领域的数据集上进行了一系列实验,以展示我们框架的实际应用。我们将该框架公开提供,并欢迎来自社区的反馈。