Differentially private stochastic gradient descent (DP-SGD) is known to have poorer training and test performance on large neural networks, compared to ordinary stochastic gradient descent (SGD). In this paper, we perform a detailed study and comparison of the two processes and unveil several new insights. By comparing the behavior of the two processes separately in early and late epochs, we find that while DP-SGD makes slower progress in early stages, it is the behavior in the later stages that determines the end result. This separate analysis of the clipping and noise addition steps of DP-SGD shows that while noise introduces errors to the process, gradient descent can recover from these errors when it is not clipped, and clipping appears to have a larger impact than noise. These effects are amplified in higher dimensions (large neural networks), where the loss basin occupies a lower dimensional space. We argue theoretically and using extensive experiments that magnitude pruning can be a suitable dimension reduction technique in this regard, and find that heavy pruning can improve the test accuracy of DPSGD.
翻译:差分隐私随机梯度下降(DP-SGD)在大型神经网络上的训练和测试性能通常逊于普通随机梯度下降(SGD)。本文对这两种过程进行了详细研究与比较,揭示了几项新发现。通过分别分析早、晚阶段中两者的行为,我们发现虽然DP-SGD在早期阶段进展较慢,但最终结果由后期行为决定。对DP-SGD中裁剪和噪声添加步骤的独立分析表明:噪声虽会引入误差,但梯度下降在未裁剪时能从这些误差中恢复;而裁剪的影响可能大于噪声。这些效应在高维场景(大型神经网络)中被放大,此时损失盆地占据更低维空间。我们通过理论论证与大范围实验表明,幅度剪枝在此方面可作为合适的降维技术,并发现重度剪枝能提升DP-SGD的测试精度。