Conditional testing via the knockoff framework allows one to identify -- among large number of possible explanatory variables -- those that carry unique information about an outcome of interest, and also provides a false discovery rate guarantee on the selection. This approach is particularly well suited to the analysis of genome wide association studies (GWAS), which have the goal of identifying genetic variants which influence traits of medical relevance. While conditional testing can be both more powerful and precise than traditional GWAS analysis methods, its vanilla implementation encounters a difficulty common to all multivariate analysis methods: it is challenging to distinguish among multiple, highly correlated regressors. This impasse can be overcome by shifting the object of inference from single variables to groups of correlated variables. To achieve this, it is necessary to construct "group knockoffs." While successful examples are already documented in the literature, this paper substantially expands the set of algorithms and software for group knockoffs. We focus in particular on second-order knockoffs, for which we describe correlation matrix approximations that are appropriate for GWAS data and that result in considerable computational savings. We illustrate the effectiveness of the proposed methods with simulations and with the analysis of albuminuria data from the UK Biobank. The described algorithms are implemented in an open-source Julia package Knockoffs.jl, for which both R and Python wrappers are available.
翻译:通过敲除框架的条件检验方法,能够在大量解释变量中识别出对目标结果携带独特信息的变量,并为筛选结果提供错误发现率保证。该方法尤其适用于全基因组关联研究(GWAS)的分析——这类研究旨在识别影响医学相关性状的遗传变异。虽然条件检验比传统GWAS分析方法更具统计功效和精准度,但其原始实现面临所有多变量分析方法共有的难题:难以区分多个高度相关的回归变量。这一困境可通过将推断对象从单一变量转向相关变量群组来解决。为此,需要构建"群组敲除"。尽管文献中已有成功案例,本文系统扩展了群组敲除的算法与软件体系。我们特别关注二阶敲除方法,描述了适用于GWAS数据的相关矩阵近似技术,该方法能显著降低计算成本。通过模拟实验以及英国生物银行白蛋白尿数据的分析,我们验证了所提方法的有效性。文中描述的算法已开源实现于Julia语言包Knockoffs.jl中,并提供了R语言与Python语言的调用接口。