For databases consisting of many text documents, one of the most fundamental data analysis tasks is counting (i) how often a pattern appears as a substring in the database (substring counting) and (ii) how many documents in the collection contain the pattern as a substring (document counting). If such a database contains sensitive data, it is crucial to protect the privacy of individuals in the database. Differential privacy is the gold standard for privacy in data analysis. It gives rigorous privacy guarantees, but comes at the cost of yielding less accurate results. In this paper, we carry out a theoretical study of substring and document counting under differential privacy. We propose a data structure storing $ε$-differentially private counts for all possible query patterns with a maximum additive error of $O(\ell\cdot\mathrm{polylog}(n\ell|Σ|))$, where $\ell$ is the maximum length of a document in the database, $n$ is the number of documents, and $|Σ|$ is the size of the alphabet. We also improve the error bound for document counting with $(ε, δ)$-differential privacy to $O(\sqrt{\ell}\cdot\mathrm{polylog}(n\ell|Σ|))$. We show that our additive errors for substring counting and document counting are optimal up to an $O(\mathrm{polylog}(n\ell))$ factor both for $ε$-differential privacy and $(ε, δ)$-differential privacy. Our data structures immediately lead to improved algorithms for related problems, such as privately mining frequent substrings and q-grams. Additionally, we develop a new technique of independent interest for differentially privately computing a general class of counting functions on trees.
翻译:针对由大量文本文档构成的数据库,最基础的数据分析任务之一是统计:(i) 模式作为子串在数据库中出现的频次(子串计数),以及(ii) 集合中包含该模式作为子串的文档数量(文档计数)。若此类数据库包含敏感数据,保护库中个体的隐私至关重要。差分隐私是数据分析领域保障隐私的黄金标准,它能提供严格的隐私保证,但代价是降低结果的精确度。本文对差分隐私下的子串与文档计数问题进行了理论研究。我们提出一种数据结构,存储所有可能查询模式的ε-差分隐私计数,其最大加性误差为O(ℓ·poly log(nℓ|Σ|)),其中ℓ为数据库中文档的最大长度,n为文档数量,|Σ|为字母表大小。同时,我们将(ε, δ)-差分隐私下文档计数的误差界改进至O(√ℓ· poly log(nℓ|Σ|))。研究表明,对于ε-差分隐私和(ε, δ)-差分隐私,我们提出的子串计数与文档计数的加性误差在至多O(poly log(nℓ))因子范围内是最优的。该数据结构可直接推动相关问题的算法改进,例如频繁子串与q-gram的隐私挖掘。此外,我们开发了一种独立于现有方法的新技术,用于差分隐私地计算树结构上的通用计数函数类。