Privatized text rewriting with local differential privacy (LDP) is a recent approach that enables sharing of sensitive textual documents while formally guaranteeing privacy protection to individuals. However, existing systems face several issues, such as formal mathematical flaws, unrealistic privacy guarantees, privatization of only individual words, as well as a lack of transparency and reproducibility. In this paper, we propose a new system 'DP-BART' that largely outperforms existing LDP systems. Our approach uses a novel clipping method, iterative pruning, and further training of internal representations which drastically reduces the amount of noise required for DP guarantees. We run experiments on five textual datasets of varying sizes, rewriting them at different privacy guarantees and evaluating the rewritten texts on downstream text classification tasks. Finally, we thoroughly discuss the privatized text rewriting approach and its limitations, including the problem of the strict text adjacency constraint in the LDP paradigm that leads to the high noise requirement.
翻译:本地差分隐私(LDP)下的私有化文本重写是一种新兴方法,能够在向个体提供正式隐私保护保证的前提下实现敏感文本文档的共享。然而,现有系统存在若干问题,包括形式化的数学缺陷、不切实际的隐私保证、仅对单个词进行私有化处理,以及缺乏透明性和可复现性。本文提出一种新系统"DP-BART",其性能显著优于现有LDP系统。我们的方法采用新型裁剪技术、迭代剪枝策略,并对内部表示进行再训练,从而大幅降低满足差分隐私保证所需的噪声量。我们在五个不同规模的数据集上进行实验,在不同隐私保证下对文本进行重写,并评估重写文本在下游文本分类任务中的表现。最后,我们深入讨论了私有化文本重写方法及其局限性,包括LDP范式中因严格的文本邻接约束导致高噪声需求的问题。