Training Deep Neural Networks (DNNs) with small batches using Stochastic Gradient Descent (SGD) yields superior test performance compared to larger batches. The specific noise structure inherent to SGD is known to be responsible for this implicit bias. DP-SGD, used to ensure differential privacy (DP) in DNNs' training, adds Gaussian noise to the clipped gradients. Surprisingly, large-batch training still results in a significant decrease in performance, which poses an important challenge because strong DP guarantees necessitate the use of massive batches. We first show that the phenomenon extends to Noisy-SGD (DP-SGD without clipping), suggesting that the stochasticity (and not the clipping) is the cause of this implicit bias, even with additional isotropic Gaussian noise. We theoretically analyse the solutions obtained with continuous versions of Noisy-SGD for the Linear Least Square and Diagonal Linear Network settings, and reveal that the implicit bias is indeed amplified by the additional noise. Thus, the performance issues of large-batch DP-SGD training are rooted in the same underlying principles as SGD, offering hope for potential improvements in large batch training strategies.
翻译:使用小批量随机梯度下降(SGD)训练深度神经网络(DNN)相比大批量训练可获得更优的测试性能。SGD固有的特定噪声结构被认为是导致这种隐式偏差的原因。用于保障DNN训练差分隐私(DP)的DP-SGD算法,会在裁剪后的梯度中添加高斯噪声。令人惊讶的是,大批量训练仍会导致性能显著下降,这构成了重要挑战,因为强DP保证需要采用大规模批量。我们首先证明该现象可推广至噪声SGD(无裁剪的DP-SGD),表明随机性(而非裁剪)是隐式偏差的根源,即使添加各向同性高斯噪声时也是如此。我们通过理论分析了线性最小二乘和对角线性网络设定下噪声SGD连续版本所获得的解,揭示出附加噪声确实放大了隐式偏差。因此,大批量DP-SGD训练的性能问题根植于与SGD相同的基本原理,这为改进大批量训练策略带来了希望。