Motivated by the recent empirical success of incorporating public data into differentially private learning, we theoretically investigate how a shared representation learned from public data can improve private learning. We explore two common scenarios of transfer learning for linear regression, both of which assume the public and private tasks (regression vectors) share a low-rank subspace in a high-dimensional space. In the first single-task transfer scenario, the goal is to learn a single model shared across all users, each corresponding to a row in a dataset. We provide matching upper and lower bounds showing that our algorithm achieves the optimal excess risk within a natural class of algorithms that search for the linear model within the given subspace estimate. In the second scenario of multitask model personalization, we show that with sufficient public data, users can avoid private coordination, as purely local learning within the given subspace achieves the same utility. Taken together, our results help to characterize the benefits of public data across common regimes of private transfer learning.
翻译:受近期将公共数据融入差分隐私学习取得实证成功的启发,我们从理论上探讨了从公共数据中学习的共享表示如何改进私有学习。我们针对线性回归的两种常见迁移学习场景展开研究,这两种场景均假设公共任务和私有任务(回归向量)在高维空间中共享一个低秩子空间。在第一种单任务迁移场景中,目标是学习一个所有用户共享的单一模型,每个用户对应数据集中的一行。我们给出了相匹配的上下界,证明在给定子空间估计内搜索线性模型的自然算法类中,我们的算法实现了最优过剩风险。在第二种多任务模型个性化场景中,我们证明当拥有足够的公共数据时,用户可避免私有协调,因为在给定子空间内进行纯本地学习即可达到相同的效用。综合来看,我们的研究结果有助于刻画公共数据在不同私有迁移学习场景中的优势。