Applying differential privacy (DP) by means of the DP-SGD algorithm to protect individual data points during training is becoming increasingly popular in NLP. However, the choice of granularity at which DP is applied is often neglected. For example, neural machine translation (NMT) typically operates on the sentence-level granularity. From the perspective of DP, this setup assumes that each sentence belongs to a single person and any two sentences in the training dataset are independent. This assumption is however violated in many real-world NMT datasets, e.g., those including dialogues. For proper application of DP we thus must shift from sentences to entire documents. In this paper, we investigate NMT at both the sentence and document levels, analyzing the privacy/utility trade-off for both scenarios, and evaluating the risks of not using the appropriate privacy granularity in terms of leaking personally identifiable information (PII). Our findings indicate that the document-level NMT system is more resistant to membership inference attacks, emphasizing the significance of using the appropriate granularity when working with DP.
翻译:在自然语言处理领域,通过DP-SGD算法应用差分隐私以保护训练过程中的个体数据点正日益普及。然而,应用差分隐私时所选择的粒度往往被忽视。例如,神经机器翻译通常在句子级别粒度上操作。从差分隐私的角度来看,这种设置假定每个句子属于单一个体,且训练数据集中任意两个句子相互独立。然而,这一假设在许多现实世界的神经机器翻译数据集中并不成立,例如包含对话的数据集。因此,为正确应用差分隐私,我们必须从句子层面转向整个文档层面。本文在句子和文档两个层面研究神经机器翻译,分析两种场景下的隐私/效用权衡,并评估因未采用适当隐私粒度而导致个人可识别信息泄露的风险。我们的研究结果表明,文档级别的神经机器翻译系统对成员推理攻击具有更强的抵抗能力,这凸显了在应用差分隐私时选择适当粒度的重要性。