Privatized text rewriting with local differential privacy (LDP) is a recent approach that enables sharing of sensitive textual documents while formally guaranteeing privacy protection to individuals. However, existing systems face several issues, such as formal mathematical flaws, unrealistic privacy guarantees, privatization of only individual words, as well as a lack of transparency and reproducibility. In this paper, we propose a new system 'DP-BART' that largely outperforms existing LDP systems. Our approach uses a novel clipping method, iterative pruning, and further training of internal representations which drastically reduces the amount of noise required for DP guarantees. We run experiments on five textual datasets of varying sizes, rewriting them at different privacy guarantees and evaluating the rewritten texts on downstream text classification tasks. Finally, we thoroughly discuss the privatized text rewriting approach and its limitations, including the problem of the strict text adjacency constraint in the LDP paradigm that leads to the high noise requirement.
翻译:具有局部差分隐私(LDP)的私有文本重写是一种新兴方法,能够在为个人提供隐私保护形式化保证的同时实现敏感文本文档的共享。然而,现有系统面临若干问题,包括数学形式化缺陷、不切实际的隐私保证、仅对单个词语进行私有化,以及缺乏透明性和可复现性。本文提出一种新系统"DP-BART",其性能大幅优于现有LDP系统。该方法采用新型裁剪机制、迭代剪枝以及对内部表示的进一步训练,显著降低了实现差分隐私保证所需的噪声量。我们在五个不同规模的文本数据集上开展实验,在不同隐私保证下重写文本,并在下游文本分类任务中评估重写文本。最后,我们深入讨论了私有文本重写方法及其局限性,包括LDP范式中严格的文本邻接约束导致的高噪声需求问题。