Synthetic data generation is increasingly used in applications involving privacy preservation, data sharing, and data scarcity. In many situations, preserving the dependence structure of the original data is of central interest. In this work, we propose a lightweight postprocessing methodology for synthetic tabular data based on the Orthogonal Procrustes problem. Starting from an already generated synthetic dataset, our approach constructs the closest dataset that restores the Pearson correlation structure of the original data. On the theoretical side, we show that preserving Pearson correlation is equivalent to the action of linear orthogonal maps in the centered-data subspace, and then deploy the Orthogonal Procrustes problem. However, in order for this to hold, we first establish a result ensuring that applying the Orthogonal Procrustes step remains in the aforementioned subspace under suitable assumptions. Applications to several datasets and synthetic data generators illustrate the effectiveness of the proposed approach. In particular, the numerical experiments indicate that the correlation structure can be restored while largely preserving the individual feature distributions, the geometry of the data, and the performance of downstream classification tasks.
翻译:合成数据生成在涉及隐私保护、数据共享和数据稀缺的应用中日益普及。在许多情境下,保持原始数据的依赖结构至关重要。本文提出一种基于正交Procrustes问题的轻量级后处理方法,适用于合成表格数据。从已生成的合成数据集出发,我们构建了最接近原始数据且能恢复其Pearson相关结构的数据集。理论层面,我们证明了保持Pearson相关性与中心化数据子空间中的线性正交映射作用等价,进而引入了正交Procrustes问题。然而,为确保此结论成立,我们首先建立了一个结果,确保在适当假设下应用正交Procrustes步骤仍能保持在上述子空间内。在多个数据集和合成数据生成器上的应用验证了该方法的有效性。特别地,数值实验表明,该方法能够在很大程度上保持个体特征分布、数据几何结构及下游分类任务性能的同时,恢复相关结构。