The field of text privatization often leverages the notion of $\textit{Differential Privacy}$ (DP) to provide formal guarantees in the rewriting or obfuscation of sensitive textual data. A common and nearly ubiquitous form of DP application necessitates the addition of calibrated noise to vector representations of text, either at the data- or model-level, which is governed by the privacy parameter $\varepsilon$. However, noise addition almost undoubtedly leads to considerable utility loss, thereby highlighting one major drawback of DP in NLP. In this work, we introduce a new sentence infilling privatization technique, and we use this method to explore the effect of noise in DP text rewriting. We empirically demonstrate that non-DP privatization techniques excel in utility preservation and can find an acceptable empirical privacy-utility trade-off, yet cannot outperform DP methods in empirical privacy protections. Our results highlight the significant impact of noise in current DP rewriting mechanisms, leading to a discussion of the merits and challenges of DP in NLP, as well as the opportunities that non-DP methods present.
翻译:文本隐私化领域常利用差分隐私(Differential Privacy, DP)概念,为敏感文本数据的重写或混淆提供形式化保障。一种常见且几乎无处不在的DP应用形式,需要在文本的向量表示(数据层面或模型层面)中添加经过校准的噪声,该过程由隐私参数ε控制。然而,噪声的添加几乎必然导致显著的效用损失,从而凸显了DP在自然语言处理中的一个主要缺陷。本研究提出了一种新的句子填充隐私化技术,并利用该方法探究噪声在DP文本重写中的作用。我们通过实验证明,非DP隐私化技术在效用保持方面表现优异,能够找到可接受的实证隐私-效用权衡,但在实证隐私保护方面无法超越DP方法。我们的结果突显了噪声在当前DP重写机制中的显著影响,进而引发了对DP在自然语言处理中的优势与挑战,以及非DP方法所带来机遇的讨论。