In this paper, we introduce the first diffusion model designed to generate complete synthetic human genotypes, which, by standard protocols, one can straightforwardly expand into full-length, DNA-level genomes. The synthetic genotypes mimic real human genotypes without just reproducing known genotypes, in terms of approved metrics. When training biomedically relevant classifiers with synthetic genotypes, accuracy is near-identical to the accuracy achieved when training classifiers with real data. We further demonstrate that augmenting small amounts of real with synthetically generated genotypes drastically improves performance rates. This addresses a significant challenge in translational human genetics: real human genotypes, although emerging in large volumes from genome wide association studies, are sensitive private data, which limits their public availability. Therefore, the integration of additional, insensitive data when striving for rapid sharing of biomedical knowledge of public interest appears imperative.
翻译:本文首次提出了一种用于生成完整合成人类基因型的扩散模型,通过标准流程可将其直接扩展为全长DNA水平基因组。该合成基因型在经认可的评估指标上能够模拟真实人类基因型,而非简单复现已知基因型。使用合成基因型训练生物医学相关分类器时,其准确率与使用真实数据训练所得准确率近乎一致。我们进一步证明,将少量真实基因型与合成基因型进行数据增强可显著提升模型性能。这解决了转化人类遗传学中的一个关键挑战:尽管全基因组关联研究产生了大量真实人类基因型数据,但这些数据属于敏感隐私信息,限制了其公开可用性。因此,在推动具有公共利益的生物医学知识快速共享过程中,整合额外的非敏感数据显得尤为重要。