Bayesian inference provides a principled framework for learning from complex data and reasoning under uncertainty. It has been widely applied in machine learning tasks such as medical diagnosis, drug design, and policymaking. In these common applications, the data can be highly sensitive. Differential privacy (DP) offers data analysis tools with powerful worst-case privacy guarantees and has been developed as the leading approach in privacy-preserving data analysis. In this paper, we study Metropolis-Hastings (MH), one of the most fundamental MCMC methods, for large-scale Bayesian inference under differential privacy. While most existing private MCMC algorithms sacrifice accuracy and efficiency to obtain privacy, we provide the first exact and fast DP MH algorithm, using only a minibatch of data in most iterations. We further reveal, for the first time, a three-way trade-off among privacy, scalability (i.e. the batch size), and efficiency (i.e. the convergence rate), theoretically characterizing how privacy affects the utility and computational cost in Bayesian inference. We empirically demonstrate the effectiveness and efficiency of our algorithm in various experiments.
翻译:贝叶斯推断为从复杂数据中学习以及不确定性推理提供了理论框架,已广泛应用于医疗诊断、药物设计和政策制定等机器学习任务中。在这些常见应用中,数据可能高度敏感。差分隐私作为一种具有强最坏情况隐私保障的数据分析工具,已成为隐私保护数据分析的主流方法。本文研究大规模贝叶斯推断中差分隐私保护下的Metropolis-Hastings算法——最基础的马尔可夫链蒙特卡洛方法之一。现有大多数隐私保护的MCMC算法为获取隐私保护而牺牲了准确性和效率,而本文首次提出一种精确且快速的DP-MH算法,在大多数迭代中仅需使用小批量数据。我们进一步首次揭示了隐私保护、可扩展性(即批量大小)与效率(即收敛速率)三者之间的权衡关系,从理论上刻画了隐私保护如何影响贝叶斯推断中的效用与计算成本。通过多种实验,我们实证验证了所提算法的有效性与高效性。