A widely used method to ensure privacy of unstructured text data is the multidimensional Laplace mechanism for $d_X$-privacy, which is a relaxation of differential privacy for metric spaces. We identify an intriguing peculiarity of this mechanism. When applied on a word-by-word basis, the mechanism either outputs the original word, or completely dissimilar words, and very rarely any semantically similar words. We investigate this observation in detail, and tie it to the fact that the distance of the nearest neighbor of a word in any word embedding model (which are high-dimensional) is much larger than the relative difference in distances to any of its two consecutive neighbors. We also show that the dot product of the multidimensional Laplace noise vector with any word embedding plays a crucial role in designating the nearest neighbor. We derive the distribution, moments and tail bounds of this dot product. We further propose a fix as a post-processing step, which satisfactorily removes the above-mentioned issue.
翻译:确保非结构化文本数据隐私的一种广泛应用方法是针对$d_X$-隐私的多维拉普拉斯机制,这是度量空间差分隐私的一种松弛形式。我们发现该机制存在一个耐人寻味的特性:当逐词应用该机制时,其输出要么是原始词汇,要么是完全不相关的词汇,而极少输出语义相近的词汇。我们对此现象展开详细研究,并将其归因于以下事实:在任何词嵌入模型(均属于高维空间)中,词汇与其最近邻的距离远大于该词汇与其任意两个连续邻居之间的相对距离差异。我们进一步证明,多维拉普拉斯噪声向量与任意词嵌入的点积对确定最近邻起着关键作用。我们推导了该点积的分布、矩量及尾界,并提出一种后处理修正方案,该方案能有效消除上述问题。