Applying differential privacy (DP) by means of the DP-SGD algorithm to protect individual data points during training is becoming increasingly popular in NLP. However, the choice of granularity at which DP is applied is often neglected. For example, neural machine translation (NMT) typically operates on the sentence-level granularity. From the perspective of DP, this setup assumes that each sentence belongs to a single person and any two sentences in the training dataset are independent. This assumption is however violated in many real-world NMT datasets, e.g. those including dialogues. For proper application of DP we thus must shift from sentences to entire documents. In this paper, we investigate NMT at both the sentence and document levels, analyzing the privacy/utility trade-off for both scenarios, and evaluating the risks of not using the appropriate privacy granularity in terms of leaking personally identifiable information (PII). Our findings indicate that the document-level NMT system is more resistant to membership inference attacks, emphasizing the significance of using the appropriate granularity when working with DP.
翻译:通过DP-SGD算法应用差分隐私(DP)以保护训练过程中的个体数据点,在自然语言处理领域日益普及。然而,DP应用粒度的选择常被忽视。以神经机器翻译(NMT)为例,其通常基于句子级粒度进行操作。从DP视角看,这种设置默认每个句子仅属于单一个体,且训练数据集中任意两个句子相互独立。然而,该假设在众多现实世界的NMT数据集中并不成立,例如包含对话的语料。为实现DP的正确应用,我们必须将处理单元从句子转向完整文档。本文从句子与文档两个层面探究NMT系统,分析两种场景下的隐私/效用权衡关系,并评估不当隐私粒度导致个人可识别信息(PII)泄露的风险。研究结果表明,文档级NMT系统对成员推理攻击具有更强的抵御能力,这凸显了应用DP时选择恰当粒度的重要性。