Genome data are crucial in modern medicine, offering significant potential for diagnosis and treatment. Thanks to technological advancements, many millions of healthy and diseased genomes have already been sequenced; however, obtaining the most suitable data for a specific study, and specifically for validation studies, remains challenging with respect to scale and access. Therefore, in silico genomics sequence generators have been proposed as a possible solution. However, the current generators produce inferior data using mostly shallow (stochastic) connections, detected with limited computational complexity in the training data. This means they do not take the appropriate biological relations and constraints, that originally caused the observed connections, into consideration. To address this issue, we propose cancer-inspired genomics mapper model (CGMM), that combines genetic algorithm (GA) and deep learning (DL) methods to tackle this challenge. CGMM mimics processes that generate genetic variations and mutations to transform readily available control genomes into genomes with the desired phenotypes. We demonstrate that CGMM can generate synthetic genomes of selected phenotypes such as ancestry and cancer that are indistinguishable from real genomes of such phenotypes, based on unsupervised clustering. Our results show that CGMM outperforms four current state-of-the-art genomics generators on two different tasks, suggesting that CGMM will be suitable for a wide range of purposes in genomic medicine, especially for much-needed validation studies.
翻译:基因组数据在现代医学中至关重要,为诊断和治疗提供了巨大潜力。得益于技术进步,数百万健康和患病基因组已被测序;然而,在特定研究(尤其是验证研究)中获得最合适的数据,在规模和可及性方面仍具挑战。因此,计算机模拟基因组序列生成器被提出作为可能的解决方案。然而,当前生成器主要利用训练数据中以有限计算复杂度检测到的浅层(随机)关联,生成低质量数据。这意味着它们未考虑原本导致这些关联的适当生物学关系和约束。为解决这一问题,我们提出癌症启发的基因组图谱模型(CGMM),该模型结合遗传算法(GA)与深度学习(DL)方法应对挑战。CGMM模拟生成遗传变异和突变的过程,将易获取的对照基因组转化为具有目标表型的基因组。我们证明,基于无监督聚类,CGMM能生成特定表型(如祖源和癌症)的人工基因组,这些基因组与真实表型基因组无法区分。结果表明,CGMM在两项不同任务上优于四种当前最先进的基因组生成器,提示CGMM适用于基因组医学的广泛用途,尤其在亟需的验证研究中。