Genetic heterogeneity analysis using genetic algorithm and network science

Through genome-wide association studies (GWAS), disease susceptible genetic variables can be identified by comparing the genetic data of individuals with and without a specific disease. However, the discovery of these associations poses a significant challenge due to genetic heterogeneity and feature interactions. Genetic variables intertwined with these effects often exhibit lower effect-size, and thus can be difficult to be detected using machine learning feature selection methods. To address these challenges, this paper introduces a novel feature selection mechanism for GWAS, named Feature Co-selection Network (FCSNet). FCS-Net is designed to extract heterogeneous subsets of genetic variables from a network constructed from multiple independent feature selection runs based on a genetic algorithm (GA), an evolutionary learning algorithm. We employ a non-linear machine learning algorithm to detect feature interaction. We introduce the Community Risk Score (CRS), a synthetic feature designed to quantify the collective disease association of each variable subset. Our experiment showcases the effectiveness of the utilized GA-based feature selection method in identifying feature interactions through synthetic data analysis. Furthermore, we apply our novel approach to a case-control colorectal cancer GWAS dataset. The resulting synthetic features are then used to explain the genetic heterogeneity in an additional case-only GWAS dataset.

翻译：通过全基因组关联研究（GWAS），可以通过比较患病与未患病个体的遗传数据来识别疾病易感遗传变量。然而，由于遗传异质性和特征交互作用的存在，揭示这些关联面临着重大挑战。受这些效应影响的遗传变量通常效应量较低，因此难以通过机器学习特征选择方法检测到。为解决这些问题，本文提出了一种面向GWAS的新型特征选择机制，命名为特征共选网络（FCSNet）。FCS-Net旨在从基于遗传算法（一种进化学习算法）的多次独立特征选择运行所构建的网络中，提取遗传变量的异质性子集。我们采用非线性机器学习算法检测特征交互作用，并引入综合特征"社群风险评分"（CRS），用于量化每个变量子集的集体疾病关联性。实验通过合成数据分析展示了基于遗传算法的特征选择方法在识别特征交互作用方面的有效性。此外，我们将这一新方法应用于结直肠癌病例-对照GWAS数据集，并利用生成的合成特征解释了另一个仅包含病例的GWAS数据集中的遗传异质性。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Nat. Biotechnol. | 机器学习为生物库驱动的药物发现提供动力

专知会员服务

11+阅读 · 2022年9月12日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日