Modern genomics research relies on genome-wide association studies (GWAS) to identify the few genetic variants among potentially millions that are associated with diseases of interest. Only reproducible discoveries of groups of associations improve our understanding of complex polygenic diseases and enable the development of new drugs and personalized medicine. Thus, fast multivariate variable selection methods that have a high true positive rate (TPR) while controlling the false discovery rate (FDR) are crucial. Recently, the T-Rex+GVS selector, a version of the T-Rex selector that uses the elastic net (EN) as a base selector to perform grouped variable election, was proposed. Although it significantly increased the TPR in simulated GWAS compared to the original T-Rex, its comparably high computational cost limits scalability. Therefore, we propose the informed elastic net (IEN), a new base selector that significantly reduces computation time while retaining the grouped variable selection property. We quantify its grouping effect and derive its formulation as a Lasso-type optimization problem, which is solved efficiently within the T-Rex framework by the terminated LARS algorithm. Numerical simulations and a GWAS study demonstrate that the proposed T-Rex+GVS (IEN) exhibits the desired grouping effect, reduces computation time, and achieves the same TPR as T-Rex+GVS (EN) but with lower FDR, which makes it a promising method for large-scale GWAS.
翻译:现代基因组学研究依赖全基因组关联研究(GWAS)从数百万潜在遗传变异中识别出与目标疾病相关的少数变异。唯有可复现的关联分组发现才能增进我们对复杂多基因疾病的理解,并推动新药研发与个性化医疗的发展。因此,需要兼具高真阳性率(TPR)与可控错误发现率(FDR)的快速多元变量选择方法。近期提出的T-Rex+GVS选择器(采用弹性网络作为基础选择器执行分组变量选择的T-Rex改进版本)在模拟GWAS中较原始T-Rex显著提升了TPR,但其较高的计算成本限制了可扩展性。为此,我们提出知情弹性网络(IEN)——一种在保留分组变量选择特性的同时显著减少计算时间的新型基础选择器。我们量化了其分组效应,并将其推导为Lasso型优化问题,通过终止LARS算法在T-Rex框架内高效求解。数值模拟与GWAS研究表明,所提出的T-Rex+GVS(IEN)具备理想的分组效应,在保持与T-Rex+GVS(EN)相同TPR的同时降低了FDR,且显著缩短计算时间,这使其成为大规模GWAS研究中具有前景的新方法。