A major challenge in imaging genetics and similar fields is to link high-dimensional data in one domain, e.g., genetic data, to high dimensional data in a second domain, e.g., brain imaging data. The standard approach in the area are mass univariate analyses across genetic factors and imaging phenotypes. That entails executing one genome-wide association study (GWAS) for each pre-defined imaging measure. Although this approach has been tremendously successful, one shortcoming is that phenotypes must be pre-defined. Consequently, effects that are not confined to pre-selected regions of interest or that reflect larger brain-wide patterns can easily be missed. In this work we introduce a Partial Least Squares (PLS)-based framework, which we term Cluster-Bootstrap PLS (CLUB-PLS), that can work with large input dimensions in both domains as well as with large sample sizes. One key factor of the framework is to use cluster bootstrap to provide robust statistics for single input features in both domains. We applied CLUB-PLS to investigating the genetic basis of surface area and cortical thickness in a sample of 33,000 subjects from the UK Biobank. We found 107 genome-wide significant locus-phenotype pairs that are linked to 386 different genes. We found that a vast majority of these loci could be technically validated at a high rate: using classic GWAS or Genome-Wide Inferred Statistics (GWIS) we found that 85 locus-phenotype pairs exceeded the genome-wide suggestive (P<1e-05) threshold.
翻译:影像遗传学及相关领域面临的一大挑战是将一个域的高维数据(如遗传数据)与另一个域的高维数据(如脑影像数据)相关联。该领域的标准方法是跨遗传因素和影像表型进行大规模单变量分析,这需要对每个预定义的影像测量执行一项全基因组关联研究(GWAS)。尽管该方法取得了巨大成功,但其不足之处在于表型必须预先定义,因此,那些不局限于预选感兴趣区域、或反映更大范围脑模式的影响很容易被遗漏。本文提出一种基于偏最小二乘(PLS)的框架,我们将其称为聚类引导偏最小二乘(CLUB-PLS),该框架能够处理两个域中的大输入维度以及大样本量。该框架的一个关键因素是使用聚类引导法为两个域中的单个输入特征提供稳健的统计量。我们将CLUB-PLS应用于研究来自英国生物银行(UK Biobank)33,000名受试者样本的表面积和皮层厚度的遗传基础,发现了107对全基因组显著位点-表型对,关联到386个不同基因。我们发现这些位点中的绝大多数可以通过经典GWAS或全基因组推断统计(GWIS)进行高比率技术验证:85对位点-表型对超过了全基因组建议显著性阈值(P<1e-05)。