We extend biologically-informed neural networks (BINNs) for genomic prediction (GP) and selection (GS) in crops by integrating thousands of single-nucleotide polymorphisms (SNPs) with multi-omics measurements and prior biological knowledge. Traditional genotype-to-phenotype (G2P) models depend heavily on direct mappings that achieve only modest accuracy, forcing breeders to conduct large, costly field trials to maintain or marginally improve genetic gain. Models that incorporate intermediate molecular phenotypes such as gene expression can achieve higher predictive fit, but they remain impractical for GS since such data are unavailable at deployment or design time. BINNs overcome this limitation by encoding pathway-level inductive biases and leveraging multi-omics data only during training, while using genotype data alone during inference. Applied to maize gene-expression and multi-environment field-trial data, BINN improves rank-correlation accuracy by up to 56% within and across subpopulations under sparse-data conditions and nonlinearly identifies genes that GWAS/TWAS fail to uncover. With complete domain knowledge for a synthetic metabolomics benchmark, BINN reduces prediction error by 75% relative to conventional neural nets and correctly identifies the most important nonlinear pathway. Importantly, both cases show highly sensitive BINN latent variables correlate with the experimental quantities they represent, despite not being trained on them. This suggests BINNs learn biologically-relevant representations, nonlinear or linear, from genotype to phenotype. Together, BINNs establish a framework that leverages intermediate domain information to improve genomic prediction accuracy and reveal nonlinear biological relationships that can guide genomic selection, candidate gene selection, pathway enrichment, and gene-editing prioritization.
翻译:我们通过整合数千个单核苷酸多态性(SNPs)、多组学测量数据及先验生物学知识,扩展了生物学启发的神经网络(BINNs)在作物基因组预测(GP)与选择(GS)中的应用。传统的基因型-表型(G2P)模型严重依赖直接映射关系,仅能达到有限的预测精度,迫使育种者必须开展大规模、高成本的田间试验以维持或微幅提升遗传增益。整合基因表达等中间分子表型的模型可获得更高的预测拟合度,但由于此类数据在模型部署或设计阶段无法获取,它们在实际基因组选择中仍不具可行性。BINNs通过编码通路层级的归纳偏置,仅在训练阶段利用多组学数据,而在推理阶段仅使用基因型数据,从而突破了这一限制。在玉米基因表达与多环境田间试验数据上的应用表明,在稀疏数据条件下,BINNs在亚群内及跨亚群的排序相关性精度最高可提升56%,并能非线性地识别全基因组关联分析(GWAS)/转录组关联分析(TWAS)未能发现的基因。在合成代谢组学基准测试中,基于完整的领域知识,BINNs相较于传统神经网络将预测误差降低了75%,并准确识别出最重要的非线性通路。值得注意的是,在上述两种情况下,BINNs高度敏感性的潜在变量均与其所代表的实验观测值呈现显著相关性,尽管模型并未直接对这些观测值进行训练。这表明BINNs能够从基因型到表型的学习过程中,捕获具有生物学意义的线性或非线性表征。综上所述,BINNs建立了一个利用中间领域信息提升基因组预测精度的框架,并能揭示可指导基因组选择、候选基因筛选、通路富集分析及基因编辑优先级评估的非线性生物学关联。