Through genome-wide association studies (GWAS), disease susceptible genetic variables can be identified by comparing the genetic data of individuals with and without a specific disease. However, the discovery of these associations poses a significant challenge due to genetic heterogeneity and feature interactions. Genetic variables intertwined with these effects often exhibit lower effect-size, and thus can be difficult to be detected using machine learning feature selection methods. To address these challenges, this paper introduces a novel feature selection mechanism for GWAS, named Feature Co-selection Network (FCSNet). FCS-Net is designed to extract heterogeneous subsets of genetic variables from a network constructed from multiple independent feature selection runs based on a genetic algorithm (GA), an evolutionary learning algorithm. We employ a non-linear machine learning algorithm to detect feature interaction. We introduce the Community Risk Score (CRS), a synthetic feature designed to quantify the collective disease association of each variable subset. Our experiment showcases the effectiveness of the utilized GA-based feature selection method in identifying feature interactions through synthetic data analysis. Furthermore, we apply our novel approach to a case-control colorectal cancer GWAS dataset. The resulting synthetic features are then used to explain the genetic heterogeneity in an additional case-only GWAS dataset.
翻译:通过全基因组关联研究(GWAS),可以通过比较患病与未患病个体的遗传数据来识别疾病易感遗传变量。然而,由于遗传异质性和特征交互作用的存在,揭示这些关联面临着重大挑战。受这些效应影响的遗传变量通常效应量较低,因此难以通过机器学习特征选择方法检测到。为解决这些问题,本文提出了一种面向GWAS的新型特征选择机制,命名为特征共选网络(FCSNet)。FCS-Net旨在从基于遗传算法(一种进化学习算法)的多次独立特征选择运行所构建的网络中,提取遗传变量的异质性子集。我们采用非线性机器学习算法检测特征交互作用,并引入综合特征"社群风险评分"(CRS),用于量化每个变量子集的集体疾病关联性。实验通过合成数据分析展示了基于遗传算法的特征选择方法在识别特征交互作用方面的有效性。此外,我们将这一新方法应用于结直肠癌病例-对照GWAS数据集,并利用生成的合成特征解释了另一个仅包含病例的GWAS数据集中的遗传异质性。