Differentially Private Multi-Sampling from Distributions

Many algorithms have been developed to estimate probability distributions subject to differential privacy (DP): such an algorithm takes as input independent samples from a distribution and estimates the density function in a way that is insensitive to any one sample. A recent line of work, initiated by Raskhodnikova et al. (Neurips '21), explores a weaker objective: a differentially private algorithm that approximates a single sample from the distribution. Raskhodnikova et al. studied the sample complexity of DP \emph{single-sampling} i.e., the minimum number of samples needed to perform this task. They showed that the sample complexity of DP single-sampling is less than the sample complexity of DP learning for certain distribution classes. We define two variants of \emph{multi-sampling}, where the goal is to privately approximate $m>1$ samples. This better models the realistic scenario where synthetic data is needed for exploratory data analysis. A baseline solution to \emph{multi-sampling} is to invoke a single-sampling algorithm $m$ times on independently drawn datasets of samples. When the data comes from a finite domain, we improve over the baseline by a factor of $m$ in the sample complexity. When the data comes from a Gaussian, Ghazi et al. (Neurips '23) show that \emph{single-sampling} can be performed under approximate differential privacy; we show it is possible to \emph{single- and multi-sample Gaussians with known covariance subject to pure DP}. Our solution uses a variant of the Laplace mechanism that is of independent interest. We also give sample complexity lower bounds, one for strong multi-sampling of finite distributions and another for weak multi-sampling of bounded-covariance Gaussians.

翻译：许多算法已被开发用于在差分隐私（DP）约束下估计概率分布：此类算法以从分布中独立抽取的样本作为输入，并以对任何单个样本不敏感的方式估计密度函数。由Raskhodnikova等人（NeurIPS '21）开创的最新研究方向探索了一个更弱的目标：一种能够近似生成分布中单个样本的差分隐私算法。Raskhodnikova等人研究了DP单采样的样本复杂度，即完成此任务所需的最小样本数。他们证明，对于某些分布类别，DP单采样的样本复杂度低于DP学习的样本复杂度。我们定义了多采样的两种变体，其目标是隐私地近似生成$m>1$个样本。这更好地模拟了探索性数据分析需要合成数据的现实场景。多采样的基线解决方案是在独立抽取的样本数据集上调用$m$次单采样算法。当数据来自有限域时，我们在样本复杂度上比基线改进$m$倍。当数据来自高斯分布时，Ghazi等人（NeurIPS '23）表明可以在近似差分隐私下执行单采样；我们证明了在纯DP约束下对已知协方差的高斯分布进行单采样和多采样是可行的。我们的解决方案使用了一种具有独立研究价值的拉普拉斯机制变体。我们还给出了样本复杂度下界，一个针对有限分布的强多采样，另一个针对有界协方差高斯分布的弱多采样。