Recent advances have substantially improved the accuracy, memory cost, and training speed of differentially private (DP) deep learning, especially on large vision and language models with millions to billions of parameters. In this work, we thoroughly study the per-sample gradient clipping style, a key component in DP optimization. We show that different clipping styles have the same time complexity but instantiate an accuracy-memory trade-off: while the all-layer clipping (of coarse granularity) is the most prevalent and usually gives the best accuracy, it incurs heavier memory cost compared to other group-wise clipping, such as the layer-wise clipping (of finer granularity). We formalize this trade-off through our convergence theory and complexity analysis. Importantly, we demonstrate that the accuracy gap between group-wise clipping and all-layer clipping becomes smaller for larger models, while the memory advantage of the group-wise clipping remains. Consequently, the group-wise clipping allows DP optimization of large models to achieve high accuracy and low peak memory simultaneously.
翻译:近期研究显著提升了差分隐私深度学习的准确性、内存成本与训练速度,尤其针对参数规模达百万至数十亿的大型视觉与语言模型。本文深入研究了差分隐私优化的关键组件——每个样本梯度裁剪方式。我们证明不同裁剪方式具有相同的时间复杂度,但呈现出准确性与内存的权衡:全层裁剪(粗粒度)最为普遍且通常取得最佳准确性,但其内存成本高于分组裁剪(如细粒度的分层裁剪)。通过收敛理论与复杂度分析,我们形式化这一权衡关系。重要的是,我们揭示分组裁剪与全层裁剪间的准确性差距随模型增大而缩小,而分组裁剪的内存优势保持不变。因此,分组裁剪能够在实现大型模型高准确性的同时,兼顾低峰值内存需求。