This paper establishes the strict optimality in precision for frequency and distribution estimation under local differential privacy (LDP). We prove that a linear estimator with a symmetric and extremal configuration, and a constant support size equal to an optimized value, is sufficient to achieve the theoretical lower bound of the $\mathcal{L}_2$ loss for both frequency and distribution estimation. The theoretical $\mathcal{L}_1$ lower bound is also achieved asymptotically. Furthermore, we derive that the communication cost of such an optimal estimator can be as low as $\log_2(\frac{d(d-1)}{2}+1)$ bits, where $d$ denotes the dictionary size, and propose an algorithm to generate this optimal estimator. In addition, we introduce a modified Count-Mean Sketch and demonstrate that it is practically indistinguishable from theoretical optimality with a sufficiently large dictionary size (e.g., $d=100$ for a privacy parameter of $ε= 1$). We compare existing methods with our proposed optimal estimator to provide selection guidelines for practical deployment. Finally, the performance of these estimators is evaluated experimentally, showing that the empirical results are consistent with our theoretical derivations.
翻译:本文建立了在局部差分隐私(LDP)下频率与分布估计精度的严格最优性。我们证明,采用对称极值配置且支持集大小等于优化值的线性估计器,足以在频率和分布估计中达到$\mathcal{L}_2$损失的理论下界。同时,该估计器也渐近实现了$\mathcal{L}_1$损失的理论下界。此外,我们推导出此类最优估计器的通信开销可低至$\log_2(\frac{d(d-1)}{2}+1)$比特,其中$d$表示词典大小,并提出一种生成该最优估计器的算法。进一步地,我们引入改进的Count-Mean Sketch,并证明当词典规模足够大时(例如隐私参数$\varepsilon=1$时$d=100$),其性能与理论最优性几乎无差异。通过对比现有方法与我们所提最优估计器,为实际部署提供了选择指南。最后,通过实验评估了这些估计器的性能,结果与理论推导一致。