The need to analyze sensitive data, such as medical records or financial data, has created a critical research challenge in recent years. In this paper, we adopt the framework of differential privacy, and explore mechanisms for generating an entire dataset which accurately captures characteristics of the original data. We build upon the work of Boedihardjo et al, which laid the foundations for a new optimization-based algorithm for generating private synthetic data. Importantly, we adapt their algorithm by replacing a uniform sampling step with a private distribution estimator; this allows us to obtain better computational guarantees for discrete distributions, and develop a novel algorithm suitable for continuous distributions. We also explore applications of our work to several statistical tasks.
翻译:近年来,分析敏感数据(如医疗记录或金融数据)的需求催生了一项关键研究挑战。本文采用差分隐私框架,探索生成能够准确捕捉原始数据特征的完整数据集的机制。我们在Boedihardjo等人的工作基础上展开研究,该研究为基于优化的私有合成数据生成算法奠定了基础。重要的是,我们通过将均匀采样步骤替换为私有分布估计器来改进其算法;这不仅为离散分布提供了更优的计算保证,还开发了一种适用于连续分布的新颖算法。此外,我们还探索了本工作在若干统计任务中的应用。