Adaptive Online Bayesian Estimation of Frequency Distributions with Local Differential Privacy

We propose a novel Bayesian approach for the adaptive and online estimation of the frequency distribution of a finite number of categories under the local differential privacy (LDP) framework. The proposed algorithm performs Bayesian parameter estimation via posterior sampling and adapts the randomization mechanism for LDP based on the obtained posterior samples. We propose a randomized mechanism for LDP which uses a subset of categories as an input and whose performance depends on the selected subset and the true frequency distribution. By using the posterior sample as an estimate of the frequency distribution, the algorithm performs a computationally tractable subset selection step to maximize the utility of the privatized response of the next user. We propose several utility functions related to well-known information metrics, such as (but not limited to) Fisher information matrix, total variation distance, and information entropy. We compare each of these utility metrics in terms of their computational complexity. We employ stochastic gradient Langevin dynamics for posterior sampling, a computationally efficient approximate Markov chain Monte Carlo method. We provide a theoretical analysis showing that (i) the posterior distribution targeted by the algorithm converges to the true parameter even for approximate posterior sampling, and (ii) the algorithm selects the optimal subset with high probability if posterior sampling is performed exactly. We also provide numerical results that empirically demonstrate the estimation accuracy of our algorithm where we compare it with nonadaptive and semi-adaptive approaches under experimental settings with various combinations of privacy parameters and population distribution parameters.

翻译：我们提出了一种新颖的贝叶斯方法，用于在本地差分隐私（LDP）框架下对有限个类别的频率分布进行自适应和在线估计。所提出的算法通过后验采样执行贝叶斯参数估计，并基于获得的后验样本调整用于LDP的随机化机制。我们提出了一种针对LDP的随机化机制，该机制以类别子集作为输入，其性能取决于所选子集和真实频率分布。通过将后验样本作为频率分布的估计，算法执行一个计算上易于处理的子集选择步骤，以最大化下一位用户私有化响应的效用。我们提出了几种与著名信息度量相关的效用函数，例如（但不限于）Fisher信息矩阵、总变差距离和信息熵。我们比较了这些效用度量在计算复杂度方面的表现。我们采用随机梯度Langevin动力学进行后验采样，这是一种计算高效的近似马尔可夫链蒙特卡洛方法。我们提供了理论分析，表明：(i) 即使对于近似后验采样，算法所针对的后验分布也会收敛到真实参数；(ii) 如果精确执行后验采样，算法将以高概率选择最优子集。我们还提供了数值结果，通过实验在不同隐私参数和总体分布参数的组合设置下，将我们的算法与非自适应和半自适应方法进行比较，经验性地证明了其估计精度。