Differentially private (DP) machine learning is considered the gold-standard solution for training a model from sensitive data while still preserving privacy. However, a major barrier to achieving this ideal is its sub-optimal privacy-accuracy trade-off, which is particularly visible in DP representation learning. Specifically, it has been shown that under modest privacy budgets, most models learn representations that are not significantly better than hand-crafted features. In this work, we show that effective DP representation learning can be done via image captioning and scaling up to internet-scale multimodal datasets. Through a series of engineering tricks, we successfully train a DP image captioner (DP-Cap) on a 233M subset of LAION-2B from scratch using a reasonable amount of computation, and obtaining unprecedented high-quality image features that can be used in a variety of downstream vision and vision-language tasks. For example, under a privacy budget of $\varepsilon=8$ for the LAION dataset, a linear classifier trained on top of learned DP-Cap features attains $65.8\%$ accuracy on ImageNet-1K, considerably improving the previous SOTA of $56.5\%$.
翻译:差分隐私(DP)机器学习被视为从敏感数据训练模型同时保护隐私的黄金标准解决方案。然而,实现这一理想的主要障碍在于其次优的隐私-准确性权衡,这一现象在DP表示学习中尤为明显。具体而言,已有研究表明,在适度的隐私预算下,大多数模型学习到的表示并不显著优于手工设计的特征。本工作中,我们证明通过图像描述生成并扩展至互联网规模的多模态数据集,可以实现有效的DP表示学习。通过一系列工程技巧,我们成功使用合理计算量在LAION-2B的2.33亿子集上从头训练了一个DP图像描述生成器(DP-Cap),获得了前所未有的高质量图像特征,这些特征可广泛应用于下游视觉及视觉-语言任务。例如,在LAION数据集隐私预算$\varepsilon=8$的条件下,基于学习到的DP-Cap特征训练的线性分类器在ImageNet-1K上达到65.8%的准确率,显著超越了先前56.5%的最优结果。