Synthetic data offers a promising tool for privacy-preserving data release, augmentation, and simulation, but its use in causal inference requires preserving more than predictive fidelity. We show that fully generative tabular synthesizers, including GAN- and LLM-based models, can achieve strong train-on-synthetic-test-on-real performance while substantially distorting causal estimands such as the average treatment effect (ATE). We formalize this failure through sensitivity and tradeoff results showing that ATE preservation requires control of both the generated covariate law and the treatment-effect contrast in the outcome regression. Motivated by this observation, we propose a hybrid synthetic-data framework that generates covariates separately from the treatment and outcome mechanisms, using distance-to-closest-record diagnostics to monitor covariate synthesis and separately learned nuisance models to construct (W, A, Y) triplets. We further study targeted synthetic augmentation for practical positivity problems and characterize when added overlap support helps by improving conditional-effect estimation more than it shifts the covariate distribution. Finally, we develop a synthetic simulation engine for pre-analysis estimator evaluation, enabling finite-sample comparison of OR, IPW, AIPW, and TMLE under realistic covariate structure. Across experiments, hybrid synthetic data substantially improve ATE preservation relative to fully generative baselines and provide a practical diagnostic tool for robust causal analysis.
翻译:合成数据为隐私保护数据发布、数据增强和模拟提供了有前景的工具,但其在因果推断中的应用要求不仅保留预测保真度。我们证明,包括基于GAN和LLM的模型在内的全生成式表格合成器,在实现强“训于合成、测于真实”性能的同时,可能显著扭曲诸如平均处理效应(ATE)等因果估计量。通过敏感性与权衡分析结果,我们形式化了这一失效机制,表明ATE的保留要求同时控制生成的协变量分布以及结果回归中的处理效应对比。基于此发现,我们提出一种混合合成数据框架,将协变量与处理及结果机制分开生成:使用距最近记录距离诊断法监测协变量合成,并利用分别学习的干扰模型构建(W, A, Y)三元组。我们进一步针对实际正定性问题研究了定向合成增强,并刻画了新增重叠支持如何在改善条件效应估计的同时减少其对协变量分布的偏移。最后,我们开发了一个用于预分析估计量评估的合成模拟引擎,能够在真实协变量结构下实现OR、IPW、AIPW和TMLE的有限样本比较。实验表明,混合合成数据在ATE保留方面显著优于全生成式基线方法,并为稳健因果分析提供了实用的诊断工具。