Recent advances in generative models facilitate the creation of synthetic data to be made available for research in privacy-sensitive contexts. However, the analysis of synthetic data raises a unique set of methodological challenges. In this work, we highlight the importance of inferential utility and provide empirical evidence against naive inference from synthetic data, whereby synthetic data are treated as if they were actually observed. Before publishing synthetic data, it is essential to develop statistical inference tools for such data. By means of a simulation study, we show that the rate of false-positive findings (type 1 error) will be unacceptably high, even when the estimates are unbiased. Despite the use of a previously proposed correction factor, this problem persists for deep generative models, in part due to slower convergence of estimators and resulting underestimation of the true standard error. We further demonstrate our findings through a case study.
翻译:近年来,生成模型的进展促进了合成数据的创建,使其能够在隐私敏感的研究场景中得以应用。然而,合成数据的分析带来了一系列独特的方法学挑战。本研究强调了推断效用的重要性,并通过实证证据指出,将合成数据视为实际观测数据进行朴素推断的做法存在不足。在发布合成数据之前,必须开发适用于此类数据的统计推断工具。通过模拟研究,我们发现即使估计量是无偏的,错误发现率(第一类错误)仍会高至不可接受的程度。尽管采用了先前提出的校正因子,该问题在深度生成模型中依然存在,部分原因是估计量的收敛速度较慢,导致对真实标准误的低估。我们通过一项案例研究进一步验证了这些发现。