Synthetic data have gained increasing attention across various domains, with a growing emphasis on their performance in downstream prediction tasks. However, most existing synthesis strategies focus on maintaining statistical information. Although some studies address prediction performance guarantees, their single-stage synthesis designs make it challenging to balance the privacy requirements that necessitate significant perturbations and the prediction performance that is sensitive to such perturbations. We propose a two-stage synthesis strategy. In the first stage, we introduce a synthesis-then-hybrid strategy, which involves a synthesis operation to generate pure synthetic data, followed by a hybrid operation that fuses the synthetic data with the original data. In the second stage, we present a kernel ridge regression (KRR)-based synthesis strategy, where a KRR model is first trained on the original data and then used to generate synthetic outputs based on the synthetic inputs produced in the first stage. By leveraging the theoretical strengths of KRR and the covariant distribution retention achieved in the first stage, our proposed two-stage synthesis strategy enables a statistics-driven restricted privacy--prediction trade-off and guarantee optimal prediction performance. We validate our approach and demonstrate its characteristics of being statistics-driven and restricted in achieving the privacy--prediction trade-off both theoretically and numerically. Additionally, we showcase its generalizability through applications to a marketing problem and five real-world datasets.
翻译:合成数据在各个领域日益受到关注,其在下游预测任务中的性能愈发受到重视。然而,现有的大多数合成策略主要侧重于保持统计信息。尽管部分研究涉及预测性能保证,但其单阶段合成设计难以平衡需要显著扰动的隐私要求与对此类扰动敏感的预测性能。我们提出一种两阶段合成策略。在第一阶段,我们引入“合成-混合”策略,该策略首先通过合成操作生成纯合成数据,随后通过混合操作将合成数据与原始数据融合。在第二阶段,我们提出一种基于核岭回归(KRR)的合成策略:首先在原始数据上训练KRR模型,然后利用该模型基于第一阶段生成的合成输入来生成合成输出。通过结合KRR的理论优势与第一阶段实现的协变分布保持特性,我们提出的两阶段合成策略能够实现基于统计的受限隐私-预测权衡,并保证最优预测性能。我们从理论和数值上验证了所提方法,并证明了其在实现隐私-预测权衡时具有基于统计和受限的特性。此外,我们通过一个营销问题及五个真实数据集的案例应用展示了该方法的泛化能力。