The generation of synthetic clinical trial data offers a promising approach to mitigating privacy concerns and data accessibility limitations in medical research. However, ensuring that synthetic datasets maintain high fidelity, utility, and adherence to domain-specific constraints remains a key challenge. While hyperparameter optimization (HPO) improves generative model performance, the effectiveness of different optimization strategies for synthetic clinical data remains unclear. This study systematically evaluates four HPO objectives across nine generative models, comparing single-metric to compound metric optimization. Our results demonstrate that HPO consistently improves synthetic data quality, with Tab DDPM achieving the largest relative gains, followed by TVAE (60%), CTGAN (39%), and CTAB-GAN+ (38%). Compound metric optimization outperformed single-metric objectives, producing more generalizable synthetic datasets. Despite improving overall quality, HPO alone fails to prevent violations of essential clinical survival constraints. Preprocessing and postprocessing played a crucial role in reducing these violations, as models lacking robust processing steps produced invalid data in up to 61% of cases. These findings underscore the necessity of integrating explicit domain knowledge alongside HPO to generate high-quality synthetic datasets. Our study provides actionable recommendations for improving synthetic data generation, with future work needed to refine metric selection and validate findings on larger datasets.


翻译:合成临床试验数据的生成提供了一种有前景的方法,以缓解医学研究中的隐私担忧和数据可及性限制。然而,确保合成数据集保持高保真度、实用性并遵守特定领域约束仍是一个关键挑战。虽然超参数优化(HPO)提升了生成模型的性能,但不同优化策略在合成临床数据中的有效性尚不明确。本研究系统评估了九种生成模型中的四种HPO目标,比较了单指标与复合指标优化。我们的结果表明,HPO持续提高了合成数据质量,其中Tab DDPM实现了最大的相对增益,其次是TVAE(60%)、CTGAN(39%)和CTAB-GAN+(38%)。复合指标优化优于单指标目标,产生了更具泛化性的合成数据集。尽管HPO提升了整体质量,但仅靠HPO无法避免对关键临床生存约束的违反。预处理和后处理在减少这些违反中发挥了关键作用,缺乏稳健处理步骤的模型在高达61%的案例中产生了无效数据。这些发现强调了将显式领域知识与HPO结合以生成高质量合成数据集的必要性。我们的研究为改进合成数据生成提供了可操作的建议,未来工作需细化指标选择并在更大数据集上验证结果。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员