engGNN: A Dual-Graph Neural Network for Omics-Based Disease Classification and Feature Selection

Omics data, such as transcriptomics, proteomics, and metabolomics, provide critical insights into disease mechanisms and clinical outcomes. However, their high dimensionality, small sample sizes, and intricate biological networks pose major challenges for reliable prediction and meaningful interpretation. Graph Neural Networks (GNNs) offer a promising way to integrate prior knowledge by encoding feature relationships as graphs. Yet, existing methods typically rely solely on either an externally curated feature graph or a data-driven generated one, which limits their ability to capture complementary information. To address this, we propose the external and generated Graph Neural Network (engGNN), a dual-graph framework that jointly leverages both external known biological networks and data-driven generated graphs. Specifically, engGNN constructs a biologically informed undirected feature graph from established network databases and complements it with a directed feature graph derived from tree-ensemble models. This dual-graph design produces more comprehensive embeddings, thereby improving predictive performance and interpretability. Through extensive simulations and real-world applications to gene expression data, engGNN consistently outperforms state-of-the-art baselines. Beyond classification, engGNN provides interpretable feature importance scores that facilitate biologically meaningful discoveries, such as pathway enrichment analysis. Taken together, these results highlight engGNN as a robust, flexible, and interpretable framework for disease classification and biomarker discovery in high-dimensional omics contexts.

翻译：组学数据，如转录组学、蛋白质组学和代谢组学，为疾病机制和临床结果提供了关键洞见。然而，其高维性、小样本量以及复杂的生物网络对可靠的预测和有意义的解释构成了重大挑战。图神经网络（GNNs）通过将特征关系编码为图，为整合先验知识提供了一种有前景的方法。然而，现有方法通常仅依赖于外部构建的特征图或数据驱动生成的图，这限制了其捕获互补信息的能力。为解决这一问题，我们提出了外部与生成图神经网络（engGNN），这是一种双图框架，能够联合利用外部已知生物网络和数据驱动生成的图。具体而言，engGNN从已建立的网络数据库中构建一个基于生物信息的无向特征图，并通过从树集成模型导出的有向特征图对其进行补充。这种双图设计产生了更全面的嵌入，从而提高了预测性能和可解释性。通过对基因表达数据进行广泛的模拟和实际应用，engGNN始终优于最先进的基线方法。除了分类之外，engGNN还提供了可解释的特征重要性评分，有助于进行具有生物学意义的发现，例如通路富集分析。综上所述，这些结果突显了engGNN作为一个稳健、灵活且可解释的框架，适用于高维组学背景下的疾病分类和生物标志物发现。