Bayesian inference provides a principled framework for learning from complex data and reasoning under uncertainty. It has been widely applied in machine learning tasks such as medical diagnosis, drug design, and policymaking. In these common applications, data can be highly sensitive. Differential privacy (DP) offers data analysis tools with powerful worst-case privacy guarantees and has been developed as the leading approach in privacy-preserving data analysis. In this paper, we study Metropolis-Hastings (MH), one of the most fundamental MCMC methods, for large-scale Bayesian inference under differential privacy. While most existing private MCMC algorithms sacrifice accuracy and efficiency to obtain privacy, we provide the first exact and fast DP MH algorithm, using only a minibatch of data in most iterations. We further reveal, for the first time, a three-way trade-off among privacy, scalability (i.e. the batch size), and efficiency (i.e. the convergence rate), theoretically characterizing how privacy affects the utility and computational cost in Bayesian inference. We empirically demonstrate the effectiveness and efficiency of our algorithm in various experiments.
翻译:贝叶斯推断为从复杂数据中学习以及在不确定性下进行推理提供了原则性框架,已被广泛应用于医疗诊断、药物设计、政策制定等机器学习任务中。在这些常见应用中,数据可能具有高度敏感性。差分隐私(DP)提供了具有强大最坏情况隐私保障的数据分析工具,已成为隐私保护数据分析领域的领先方法。本文研究梅特罗波利斯-黑斯廷斯(MH)算法——最基础的马尔可夫链蒙特卡洛(MCMC)方法之一——在差分隐私条件下的大规模贝叶斯推断问题。现有大多数私有MCMC算法为获得隐私性而牺牲了精度与效率,我们首次提出了一种精确且快速的DP-MH算法,在大多数迭代中仅使用小批量数据。此外,我们首次揭示了隐私性、可扩展性(即批大小)与效率(即收敛速率)之间的三方权衡关系,从理论上刻画了隐私性如何影响贝叶斯推断中的效用与计算成本。通过多项实验,我们实证验证了所提算法的有效性与高效性。