Existing private synthetic data generation algorithms are agnostic to downstream tasks. However, end users may have specific requirements that the synthetic data must satisfy. Failure to meet these requirements could significantly reduce the utility of the data for downstream use. We introduce a post-processing technique that improves the utility of the synthetic data with respect to measures selected by the end user, while preserving strong privacy guarantees and dataset quality. Our technique involves resampling from the synthetic data to filter out samples that do not meet the selected utility measures, using an efficient stochastic first-order algorithm to find optimal resampling weights. Through comprehensive numerical experiments, we demonstrate that our approach consistently improves the utility of synthetic data across multiple benchmark datasets and state-of-the-art synthetic data generation algorithms.
翻译:现有私密合成数据生成算法通常与下游任务无关。然而,最终用户可能要求合成数据必须满足特定需求。若无法满足这些需求,将显著降低数据在下游应用中的实用性。我们提出一种后处理技术,能在保持强隐私保证和数据集质量的同时,针对最终用户选定的测度提升合成数据实用性。该技术通过从合成数据中重采样,过滤不符合选定实用性测度的样本,并采用高效的随机一阶算法求解最优重采样权重。大量数值实验表明,我们的方法在多个基准数据集和最优合成数据生成算法上,均能稳定提升合成数据实用性。