This manuscript delves into the intersection of genomics and phenotypic prediction, focusing on the statistical innovation required to navigate the complexities introduced by noisy covariates and confounders. The primary emphasis is on the development of advanced robust statistical models tailored for genomic prediction from single nucleotide polymorphism (SNP) data collected from genome-wide association studies (GWAS) in plant and animal breeding and multi-field trials. The manuscript explores the limitations of traditional marker-assisted recurrent selection, highlighting the significance of incorporating all estimated effects of marker loci into the statistical framework and aiming to reduce the high dimensionality of GWAS data while preserving critical information. This paper introduces a new robust statistical framework for genomic prediction, employing one-stage and two-stage linear mixed model analyses along with utilizing the popular robust minimum density power divergence estimator (MDPDE) to estimate genetic effects on phenotypic traits. The study illustrates the superior performance of the proposed MDPDE-based genomic prediction and associated heritability estimation procedures over existing competitors through extensive empirical experiments on artificial datasets and application to a real-life maize breeding dataset. The results showcase the robustness and accuracy of the proposed MDPDE-based approaches, especially in the presence of data contamination, emphasizing their potential applications in improving breeding programs and advancing genomic prediction of phenotyping traits.
翻译:本文深入探讨了基因组学与表型预测的交叉领域,重点聚焦于应对噪声协变量和混杂因素复杂性所需的统计创新。核心目标是开发适用于动植物育种及多田间试验中全基因组关联研究(GWAS)所采集单核苷酸多态性(SNP)数据的高级鲁棒统计模型。文章剖析了传统标记辅助轮回选择的局限性,强调将标记位点的所有估计效应纳入统计框架的重要性,并致力于在保留关键信息的同时降低GWAS数据的高维性。本文提出了一种全新的基因组预测鲁棒统计框架,通过采用单阶段和两阶段线性混合模型分析,结合应用流行的鲁棒最小密度幂散度估计器(MDPDE)来估计遗传效应对表型性状的影响。通过对人工数据集的大量实证实验及真实玉米育种数据集的应用,本研究展示了所提出的基于MDPDE的基因组预测及关联遗传力估计方法相较于现有竞争方法的优越性能。结果凸显了该方法在数据污染存在情况下的鲁棒性与准确性,强调了其在改良育种程序、推进表型性状基因组预测方面的潜在应用价值。