SNPgen: Phenotype-Supervised Genotype Representation and Synthetic Data Generation via Latent Diffusion

Polygenic risk scores and other genomic analyses require large individual-level genotype datasets, yet strict data access restrictions impede sharing. Synthetic genotype generation offers a privacy-preserving alternative, but most existing methods operate unconditionally, producing samples without phenotype alignment, or rely on unsupervised compression, creating a gap between statistical fidelity and downstream task utility. We present SNPgen, a two-stage conditional latent diffusion framework for generating phenotype-supervised synthetic genotypes. SNPgen combines GWAS-guided variant selection (1,024-2,048 trait-associated SNPs) with a variational autoencoder for genotype compression and a latent diffusion model conditioned on binary disease labels via classifier-free guidance. Evaluated on 458,724 UK Biobank individuals across four complex diseases (coronary artery disease, breast cancer, type 1 and type 2 diabetes), models trained on synthetic data matched real-data predictive performance in a train-on-synthetic, test-on-real protocol, approaching genome-wide PRS methods that use $2$-$6\times$ more variants. Privacy analysis confirmed zero identical matches, near-random membership inference (AUC $\approx 0.50$), preserved linkage disequilibrium structure, and high allele frequency correlation ($r \geq 0.95$) with source data. A controlled simulation with known causal effects verified faithful recovery of the imposed genetic association structure.

翻译：多基因风险评分及其他基因组分析需要大规模个体层面基因型数据集，然而严格的数据访问限制阻碍了数据共享。合成基因型生成提供了一种隐私保护的替代方案，但现有方法大多为无条件生成，产生的样本缺乏表型对齐，或依赖于无监督压缩，导致统计保真度与下游任务效用之间存在差距。本文提出SNPgen，一种用于生成表型监督合成基因型的双阶段条件潜在扩散框架。SNPgen结合了全基因组关联研究指导的变异位点筛选（1,024-2,048个性状相关SNP）、用于基因型压缩的变分自编码器，以及通过无分类器引导基于二元疾病标签进行条件控制的潜在扩散模型。在涵盖四种复杂疾病（冠状动脉疾病、乳腺癌、1型与2型糖尿病）的458,724名UK Biobank个体数据上的评估表明，采用“合成数据训练-真实数据测试”协议时，基于合成数据训练的模型达到了与真实数据相当的预测性能，其表现接近使用$2$-$6$倍更多变异位点的全基因组PRS方法。隐私分析确认了零完全匹配记录、接近随机的成员推断攻击风险（AUC $\approx 0.50$），同时保持了连锁不平衡结构以及与源数据的高度等位基因频率相关性（$r \geq 0.95$）。通过已知因果效应的受控模拟实验，验证了该方法能够准确还原预设的遗传关联结构。