For scalable machine learning on large data sets, subsampling a representative subset is a common approach for efficient model training. This is often achieved through importance sampling, whereby informative data points are sampled more frequently. In this paper, we examine the privacy properties of importance sampling, focusing on an individualized privacy analysis. We find that, in importance sampling, privacy is well aligned with utility but at odds with sample size. Based on this insight, we propose two approaches for constructing sampling distributions: one that optimizes the privacy-efficiency trade-off; and one based on a utility guarantee in the form of coresets. We evaluate both approaches empirically in terms of privacy, efficiency, and accuracy on the differentially private $k$-means problem. We observe that both approaches yield similar outcomes and consistently outperform uniform sampling across a wide range of data sets. Our code is available on GitHub: https://github.com/smair/personalized-privacy-amplification-via-importance-sampling
翻译:在大规模数据集上进行可扩展的机器学习时,通过子采样获取代表性数据子集是实现高效模型训练的常用方法。这通常通过重要性采样实现,即对信息量更大的数据点进行更高频的采样。本文研究了重要性采样的隐私特性,重点关注个体化隐私分析。我们发现,在重要性采样中,隐私保护与模型效用具有良好的一致性,但与样本规模存在矛盾关系。基于这一发现,我们提出了两种构建采样分布的方法:一种优化隐私与效率的权衡;另一种基于核心集形式的效用保证。我们在差分隐私 $k$-均值问题上,从隐私性、效率与准确性三个维度对两种方法进行了实证评估。实验结果表明,两种方法均能产生相近的效果,且在多种数据集上持续优于均匀采样。我们的代码已发布于 GitHub:https://github.com/smair/personalized-privacy-amplification-via-importance-sampling