Differential Privacy (DP) is becoming central to large-scale training as privacy regulations tighten. We revisit how DP noise interacts with adaptivity in optimization through the lens of stochastic differential equations, providing the first SDE-based analysis of private optimizers. Focusing on DP-SGD and DP-SignSGD under per-example clipping, we show a sharp contrast under fixed hyperparameters: DP-SGD converges at a Privacy-Utility Trade-Off of $\mathcal{O}(1/\varepsilon^2)$ with speed independent of $\varepsilon$, while DP-SignSGD converges at a speed linear in $\varepsilon$ with an $\mathcal{O}(1/\varepsilon)$ trade-off, dominating in high-privacy or large batch noise regimes. By contrast, under optimal learning rates, both methods achieve comparable theoretical asymptotic performance; however, the optimal learning rate of DP-SGD scales linearly with $\varepsilon$, while that of DP-SignSGD is essentially $\varepsilon$-independent. This makes adaptive methods far more practical, as their hyperparameters transfer across privacy levels with little or no re-tuning. Empirical results confirm our theory across training and test metrics, and empirically extend from DP-SignSGD to DP-Adam.
翻译:随着隐私法规日益严格,差分隐私(DP)正成为大规模训练的核心技术。本文通过随机微分方程视角重新审视DP噪声与优化自适应性的交互作用,首次为隐私优化器提供基于SDE的理论分析。聚焦于逐样本裁剪下的DP-SGD与DP-SignSGD,我们在固定超参数条件下揭示出鲜明对比:DP-SGD以$\mathcal{O}(1/\varepsilon^2)$的隐私-效用权衡收敛且收敛速度与$\varepsilon$无关,而DP-SignSGD以$\mathcal{O}(1/\varepsilon)$的权衡实现线性依赖于$\varepsilon$的收敛速度,在高隐私或大批量噪声场景中占据优势。相比之下,在最优学习率条件下,两种方法可获得相当的理论渐近性能;然而DP-SGD的最优学习率与$\varepsilon$呈线性缩放关系,而DP-SignSGD的最优学习率基本与$\varepsilon$无关。这使得自适应方法更具实用性——其超参数可在不同隐私级别间迁移而几乎无需重新调参。实证结果在训练与测试指标上验证了我们的理论,并将结论从DP-SignSGD实证拓展至DP-Adam。