The task of $\textit{Differentially Private Text Rewriting}$ is a class of text privatization techniques in which (sensitive) input textual documents are $\textit{rewritten}$ under Differential Privacy (DP) guarantees. The motivation behind such methods is to hide both explicit and implicit identifiers that could be contained in text, while still retaining the semantic meaning of the original text, thus preserving utility. Recent years have seen an uptick in research output in this field, offering a diverse array of word-, sentence-, and document-level DP rewriting methods. Common to these methods is the selection of a privacy budget (i.e., the $\varepsilon$ parameter), which governs the degree to which a text is privatized. One major limitation of previous works, stemming directly from the unique structure of language itself, is the lack of consideration of $\textit{where}$ the privacy budget should be allocated, as not all aspects of language, and therefore text, are equally sensitive or personal. In this work, we are the first to address this shortcoming, asking the question of how a given privacy budget can be intelligently and sensibly distributed amongst a target document. We construct and evaluate a toolkit of linguistics- and NLP-based methods used to allocate a privacy budget to constituent tokens in a text document. In a series of privacy and utility experiments, we empirically demonstrate that given the same privacy budget, intelligent distribution leads to higher privacy levels and more positive trade-offs than a naive distribution of $\varepsilon$. Our work highlights the intricacies of text privatization with DP, and furthermore, it calls for further work on finding more efficient ways to maximize the privatization benefits offered by DP in text rewriting.
翻译:差分隐私文本重写是一类文本隐私化技术,其目标是在差分隐私保证下对(敏感)输入文本文档进行重写。此类方法的动机在于隐藏文本中可能包含的显式与隐式标识符,同时保持原始文本的语义含义,从而维持其实用性。近年来该领域的研究成果显著增多,涌现出多种词级、句级及文档级的差分隐私重写方法。这些方法的共同点在于需选择隐私预算(即ε参数),该参数控制文本的隐私化程度。先前研究存在一个主要局限——这一局限直接源于语言本身的独特结构——即未充分考虑隐私预算应分配至何处,因为语言(进而文本)的各个方面并非同等敏感或具有个人属性。本研究首次针对这一不足展开探索,致力于探究如何将给定的隐私预算在目标文档中进行智能且合理的分配。我们构建并评估了一套基于语言学与自然语言处理方法的工具包,用于将隐私预算分配给文本文档中的组成词元。通过一系列隐私性与实用性的实验,我们实证表明:在相同隐私预算条件下,智能分配相较于朴素的ε分配能实现更高的隐私保护水平与更优的权衡效果。本研究揭示了差分隐私文本隐私化的复杂性,并呼吁进一步探索更高效的方法,以最大化差分隐私在文本重写中提供的隐私化效益。