How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy

ML models are ubiquitous in real world applications and are a constant focus of research. At the same time, the community has started to realize the importance of protecting the privacy of ML training data. Differential Privacy (DP) has become a gold standard for making formal statements about data anonymization. However, while some adoption of DP has happened in industry, attempts to apply DP to real world complex ML models are still few and far between. The adoption of DP is hindered by limited practical guidance of what DP protection entails, what privacy guarantees to aim for, and the difficulty of achieving good privacy-utility-computation trade-offs for ML models. Tricks for tuning and maximizing performance are scattered among papers or stored in the heads of practitioners. Furthermore, the literature seems to present conflicting evidence on how and whether to apply architectural adjustments and which components are "safe" to use with DP. This work is a self-contained guide that gives an in-depth overview of the field of DP ML and presents information about achieving the best possible DP ML model with rigorous privacy guarantees. Our target audience is both researchers and practitioners. Researchers interested in DP for ML will benefit from a clear overview of current advances and areas for improvement. We include theory-focused sections that highlight important topics such as privacy accounting and its assumptions, and convergence. For a practitioner, we provide a background in DP theory and a clear step-by-step guide for choosing an appropriate privacy definition and approach, implementing DP training, potentially updating the model architecture, and tuning hyperparameters. For both researchers and practitioners, consistently and fully reporting privacy guarantees is critical, and so we propose a set of specific best practices for stating guarantees.

翻译：机器学习模型在现实世界应用中无处不在，且始终是研究热点。与此同时，社区已开始认识到保护机器学习训练数据隐私的重要性。差分隐私已成为数据匿名化形式化声明的黄金标准。然而，尽管差分隐私已在部分行业得到采用，但将其应用于真实世界复杂机器学习模型的尝试仍然稀少。差分隐私的推广受到实际指导不足的制约：DP保护的具体含义、应追求何种隐私保证、以及如何为机器学习模型实现良好的隐私-效用-计算权衡。调整和最大化性能的技巧分散在论文中或存储于从业者的头脑中。此外，文献对于如何以及是否应用架构调整、哪些组件可"安全"地用于DP，似乎呈现出相互矛盾的证据。本文是一份自包含指南，深入概述了差分隐私机器学习领域，并介绍了如何在严格隐私保证下实现最佳DP ML模型。我们的目标受众包括研究人员和从业者。对机器学习领域差分隐私感兴趣的研究人员将受益于当前进展与改进方向的清晰概述。我们包含理论导向的章节，重点强调隐私核算及其假设、收敛等重要主题。对于从业者，我们提供差分隐私理论背景和清晰的分步指南，以选择适当的隐私定义和方法、实现DP训练、潜在更新模型架构以及调整超参数。对研究人员和从业者而言，一致且完整地报告隐私保证至关重要，因此我们提出了一套关于陈述保证的具体最佳实践。