Differentially private (DP) transfer learning, i.e., fine-tuning a pretrained model on private data, is the current state-of-the-art approach for training large models under privacy constraints. We focus on two key hyperparameters in this setting: the clipping bound $C$ and batch size $B$. We show a clear mismatch between the current theoretical understanding of how to choose an optimal $C$ (stronger privacy requires smaller $C$) and empirical outcomes (larger $C$ performs better under strong privacy), caused by changes in the gradient distributions. Assuming a limited compute budget (fixed epochs), we demonstrate that the existing heuristics for tuning $B$ do not work, while cumulative DP noise better explains whether smaller or larger batches perform better. We also highlight how the common practice of using a single $(C,B)$ setting across tasks can lead to suboptimal performance. We find that performance drops especially when moving between loose and tight privacy and between plentiful and limited compute, which we explain by analyzing clipping as a form of gradient re-weighting and examining cumulative DP noise.
翻译:差分隐私(DP)迁移学习,即在私有数据上微调预训练模型,是当前在隐私约束下训练大模型的最先进方法。我们聚焦于该场景中的两个关键超参数:裁剪界$C$和批次大小$B$。我们揭示了当前关于如何选择最优$C$的理论认识(更强的隐私要求需要更小的$C$)与实证结果(在强隐私约束下更大的$C$表现更好)之间存在明显的不匹配,这源于梯度分布的变化。在有限计算预算(固定训练轮数)的假设下,我们证明了现有的$B$调优启发式方法无效,而累积DP噪声能更好地解释较小或较大批次孰优孰劣。我们还指出,跨任务使用单一$(C,B)$设置的常见做法可能导致次优性能。我们发现,性能下降在从宽松到严格的隐私约束之间以及从充足到有限的计算资源之间切换时尤为显著,我们通过将裁剪视为梯度重加权的一种形式并分析累积DP噪声来解释这一现象。