How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy

ML models are ubiquitous in real world applications and are a constant focus of research. At the same time, the community has started to realize the importance of protecting the privacy of ML training data. Differential Privacy (DP) has become a gold standard for making formal statements about data anonymization. However, while some adoption of DP has happened in industry, attempts to apply DP to real world complex ML models are still few and far between. The adoption of DP is hindered by limited practical guidance of what DP protection entails, what privacy guarantees to aim for, and the difficulty of achieving good privacy-utility-computation trade-offs for ML models. Tricks for tuning and maximizing performance are scattered among papers or stored in the heads of practitioners. Furthermore, the literature seems to present conflicting evidence on how and whether to apply architectural adjustments and which components are ``safe'' to use with DP. This work is a self-contained guide that gives an in-depth overview of the field of DP ML and presents information about achieving the best possible DP ML model with rigorous privacy guarantees. Our target audience is both researchers and practitioners. Researchers interested in DP for ML will benefit from a clear overview of current advances and areas for improvement. We include theory-focused sections that highlight important topics such as privacy accounting and its assumptions, and convergence. For a practitioner, we provide a background in DP theory and a clear step-by-step guide for choosing an appropriate privacy definition and approach, implementing DP training, potentially updating the model architecture, and tuning hyperparameters. For both researchers and practitioners, consistently and fully reporting privacy guarantees is critical, and so we propose a set of specific best practices for stating guarantees.

翻译：机器学习模型在现实世界应用中无处不在，且始终是研究焦点。与此同时，社区已开始意识到保护机器学习训练数据隐私的重要性。差分隐私已成为对数据匿名化进行正式声明的黄金标准。然而，尽管差分隐私已在工业界有所采用，但将其应用于现实世界复杂机器学习模型的尝试仍寥寥无几。差分隐私的推广受到以下因素阻碍：缺乏关于差分隐私保护内涵、应追求何种隐私保证的实用指导，以及难以在机器学习模型中实现良好的隐私-效用-计算权衡。优化和最大化性能的技巧分散于各论文中，或仅存于实践者的知识体系内。此外，文献在如何及是否应用架构调整、哪些组件可“安全”用于差分隐私方面似乎呈现矛盾证据。本文是一份自包含指南，深入概述差分隐私机器学习领域，并呈现如何以严格隐私保证实现最佳差分隐私机器学习模型的信息。我们的目标受众包括研究人员和实践者。对机器学习领域差分隐私感兴趣的研究者将受益于当前进展及改进空间的清晰概览。我们包含理论聚焦部分，重点阐述隐私核算及其假设、收敛等重要主题。对实践者，我们提供差分隐私理论背景及清晰的分步指南，涵盖选择适当隐私定义与方法、实施差分隐私训练、可能更新模型架构及调优超参数。对研究者与实践者而言，一致且完整地报告隐私保证至关重要，因此我们提出一套关于陈述保证的具体最佳实践。