We study the problem of private distribution learning with access to public data. In this setup, which we refer to as public-private learning, the learner is given public and private samples drawn from an unknown distribution $p$ belonging to a class $\mathcal Q$, with the goal of outputting an estimate of $p$ while adhering to privacy constraints (here, pure differential privacy) only with respect to the private samples. We show that the public-private learnability of a class $\mathcal Q$ is connected to the existence of a sample compression scheme for $\mathcal Q$, as well as to an intermediate notion we refer to as list learning. Leveraging this connection: (1) approximately recovers previous results on Gaussians over $\mathbb R^d$; and (2) leads to new ones, including sample complexity upper bounds for arbitrary $k$-mixtures of Gaussians over $\mathbb R^d$, results for agnostic and distribution-shift resistant learners, as well as closure properties for public-private learnability under taking mixtures and products of distributions. Finally, via the connection to list learning, we show that for Gaussians in $\mathbb R^d$, at least $d$ public samples are necessary for private learnability, which is close to the known upper bound of $d+1$ public samples.
翻译:我们研究了在可获取公共数据条件下的私有分布学习问题。在这一被称为“公共-私有学习”的设定中,学习器可获取从属于类别$\mathcal Q$的未知分布$p$中抽取的公共样本和私有样本,其目标是在仅对私有样本遵守隐私约束(此处为纯差分隐私)的前提下,输出对$p$的估计。我们证明,类别$\mathcal Q$的公共-私有可学习性与其样本压缩方案的存在性,以及我们称之为列表学习的中间概念相关联。利用这一关联:(1) 可近似复现先前关于$\mathbb R^d$上高斯分布的结果;(2) 可推导出新结果,包括$\mathbb R^d$上任意$k$混合高斯分布的样本复杂度上界、针对不可知学习器和分布偏移鲁棒学习器的结果,以及公共-私有学习性在分布混合与乘积运算下的封闭性质。最后,通过列表学习的关联,我们证明对于$\mathbb R^d$中的高斯分布,实现私有可学习性至少需要$d$个公共样本,这一下界接近已知的$d+1$个公共样本的上界。