Public pretraining is a promising approach to improve differentially private model training. However, recent work has noted that many positive research results studying this paradigm only consider in-distribution tasks, and may not apply to settings where there is distribution shift between the pretraining and finetuning data -- a scenario that is likely when finetuning private tasks due to the sensitive nature of the data. In this work, we show empirically across three tasks that even in settings with large distribution shift, where both zero-shot performance from public data and training from scratch with private data give unusably weak results, public features can in fact improve private training accuracy by up to 67\% over private training from scratch. We provide a theoretical explanation for this phenomenon, showing that if the public and private data share a low-dimensional representation, public representations can improve the sample complexity of private training even if it is impossible to learn the private task from the public data alone. Altogether, our results provide evidence that public data can indeed make private training practical in realistic settings of extreme distribution shift.
翻译:公共预训练是改进差分隐私模型训练的一种有前景的方法。然而,近期研究指出,许多探讨这一范式的正面研究成果仅考虑了同分布任务,可能不适用于预训练与微调数据之间存在分布偏移的场景——由于数据的敏感性,在微调私有任务时这一情况很可能发生。在本研究中,我们通过三项任务的经验表明,即使在分布偏移较大的情况下(此时公共数据的零样本性能与从头开始的私有数据训练均产生不可用的弱结果),公共特征实际上可将私有训练准确率相较于从头开始的私有训练提升高达67%。我们为这一现象提供了理论解释:若公共数据与私有数据共享一个低维表示,即使无法仅从公共数据中学习私有任务,公共表示仍能改善私有训练的样本复杂度。总体而言,我们的结果提供了证据,证明在极端分布偏移的实际场景中,公共数据确实能使私有训练变得可行。