The increasing interest in data sharing makes synthetic data appealing. However, the analysis of synthetic data raises a unique set of methodological challenges. In this work, we highlight the importance of inferential utility and provide empirical evidence against naive inference from synthetic data (that handles these as if they were really observed). We argue that the rate of false-positive findings (type 1 error) will be unacceptably high, even when the estimates are unbiased. One of the reasons is the underestimation of the true standard error, which may even progressively increase with larger sample sizes due to slower convergence. This is especially problematic for deep generative models. Before publishing synthetic data, it is essential to develop statistical inference tools for such data.
翻译:随着数据共享需求的日益增长,合成数据变得极具吸引力。然而,合成数据的分析带来了一系列独特的方法论挑战。本文强调了推理性效用的重要性,并提供了反对将合成数据当作实际观测数据进行朴素推断的经验证据。我们论证指出,即便估计量是无偏的,错误发现率(第一类错误)仍会高得不可接受。其中一个原因是真实标准误被低估,且由于收敛速度较慢,这种低估可能随样本量增大而逐步加剧。这一问题在深度生成模型中尤为突出。在发布合成数据之前,亟需为此类数据开发统计推断工具。