Differentially Private Substring and Document Counting

Differential privacy is the gold standard for privacy in data analysis. In many data analysis applications, the data is a database of documents. For databases consisting of many documents, one of the most fundamental problems is that of pattern matching and computing (i) how often a pattern appears as a substring in the database (substring counting) and (ii) how many documents in the collection contain the pattern as a substring (document counting). In this paper, we initiate the theoretical study of substring and document counting under differential privacy. We give an $\epsilon$-differentially private data structure solving this problem for all patterns simultaneously with a maximum additive error of $O(\ell \cdot\mathrm{polylog}(n\ell|\Sigma|))$, where $\ell$ is the maximum length of a document in the database, $n$ is the number of documents, and $|\Sigma|$ is the size of the alphabet. We show that this is optimal up to a $O(\mathrm{polylog}(n\ell))$ factor. Further, we show that for $(\epsilon,\delta)$-differential privacy, the bound for document counting can be improved to $O(\sqrt{\ell} \cdot\mathrm{polylog}(n\ell|\Sigma|))$. Additionally, our data structures are efficient. In particular, our data structures use $O(n\ell^2)$ space, $O(n^2\ell^4)$ preprocessing time, and $O(|P|)$ query time where $P$ is the query pattern. Along the way, we develop a new technique for differentially privately computing a general class of counting functions on trees of independent interest. Our data structures immediately lead to improved algorithms for related problems, such as privately mining frequent substrings and $q$-grams. For $q$-grams, we further improve the preprocessing time of the data structure.

翻译：差分隐私是数据分析中隐私保护的黄金标准。在许多数据分析应用中，数据是以文档形式存在的数据库。对于包含大量文档的数据库，最基础的问题之一是模式匹配与计算：(i) 模式作为子串在数据库中出现的频率（子串计数），以及(ii) 集合中包含该模式作为子串的文档数量（文档计数）。本文首次在差分隐私框架下对子串与文档计数问题进行理论研究。我们提出了一种$\epsilon$-差分隐私数据结构，能够同时解决所有模式的查询问题，其最大加性误差为$O(\ell \cdot\mathrm{polylog}(n\ell|\Sigma|))$，其中$\ell$表示数据库中文档的最大长度，$n$为文档数量，$|\Sigma|$为字母表大小。我们证明该误差界在$O(\mathrm{polylog}(n\ell))$因子内是最优的。进一步地，对于$(\epsilon,\delta)$-差分隐私，我们证明文档计数的误差界可改进为$O(\sqrt{\ell} \cdot\mathrm{polylog}(n\ell|\Sigma|))$。此外，我们的数据结构具有高效性：空间复杂度为$O(n\ell^2)$，预处理时间为$O(n^2\ell^4)$，查询时间为$O(|P|)$（其中$P$为查询模式）。在研究过程中，我们开发了一种具有独立理论价值的新技术，用于差分隐私地计算树结构上的一般计数函数类。我们的数据结构可直接改进相关问题的算法，例如隐私保护的频繁子串挖掘与$q$-gram分析。针对$q$-gram问题，我们进一步优化了数据结构的预处理时间。