Deep Learning for Efficient GWAS Feature Selection

Genome-Wide Association Studies (GWAS) face unique challenges in the era of big genomics data, particularly when dealing with ultra-high-dimensional datasets where the number of genetic features significantly exceeds the available samples. This paper introduces an extension to the feature selection methodology proposed by Mirzaei et al. (2020), specifically tailored to tackle the intricacies associated with ultra-high-dimensional GWAS data. Our extended approach enhances the original method by introducing a Frobenius norm penalty into the student network, augmenting its capacity to adapt to scenarios characterized by a multitude of features and limited samples. Operating seamlessly in both supervised and unsupervised settings, our method employs two key neural networks. The first leverages an autoencoder or supervised autoencoder for dimension reduction, extracting salient features from the ultra-high-dimensional genomic data. The second network, a regularized feed-forward model with a single hidden layer, is designed for precise feature selection. The introduction of the Frobenius norm penalty in the student network significantly boosts the method's resilience to the challenges posed by ultra-high-dimensional GWAS datasets. Experimental results showcase the efficacy of our approach in feature selection for GWAS data. The method not only handles the inherent complexities of ultra-high-dimensional settings but also demonstrates superior adaptability to the nuanced structures present in genomics data. The flexibility and versatility of our proposed methodology are underscored by its successful performance across a spectrum of experiments.

翻译：全基因组关联分析（GWAS）在大规模基因组数据时代面临独特挑战，尤其是在处理基因特征数量远超可用样本的超高维数据集时。本文对Mirzaei等人（2020）提出的特征选择方法进行了扩展，专门针对超高维GWAS数据的复杂性进行了定制优化。我们提出的扩展方法在原始方案基础上引入Frobenius范数惩罚项至学生网络，显著增强了其适应"多特征-少样本"场景的能力。该方法在监督与无监督两种模式下均可无缝运行，采用双神经网络架构：第一层利用自编码器或监督自编码器进行降维，从超高维基因组数据中提取显著特征；第二层采用带单隐藏层的正则化前馈模型，专用于精确特征选择。学生网络中Frobenius范数惩罚项的引入，极大提升了该方法应对超高维GWAS数据挑战的稳健性。实验结果表明，我们的方法在GWAS数据特征选择中展现出显著效能，不仅有效处理了超高维场景的固有问题，还表现出对基因组数据细微结构的卓越适应能力。通过系列实验验证，该方法以其成功的性能表现凸显了其灵活性与通用性。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日