The integration of multi-omic data is pivotal for understanding complex diseases, but its high dimensionality and noise present significant challenges. Graph Neural Networks (GNNs) offer a robust framework for analyzing large-scale signaling pathways and protein-protein interaction networks, yet they face limitations in expressivity when capturing intricate biological relationships. To address this, we propose Graph Sequence Language Model (GraphSeqLM), a framework that enhances GNNs with biological sequence embeddings generated by Large Language Models (LLMs). These embeddings encode structural and biological properties of DNA, RNA, and proteins, augmenting GNNs with enriched features for analyzing sample-specific multi-omic data. By integrating topological, sequence-derived, and biological information, GraphSeqLM demonstrates superior predictive accuracy and outperforms existing methods, paving the way for more effective multi-omic data integration in precision medicine.
翻译:多组学数据的整合对于理解复杂疾病至关重要,但其高维度和噪声带来了重大挑战。图神经网络(GNNs)为分析大规模信号通路和蛋白质-蛋白质相互作用网络提供了一个稳健的框架,但在捕捉复杂的生物学关系时,其表达能力仍面临局限。为解决这一问题,我们提出了图序列语言模型(GraphSeqLM),该框架利用大型语言模型(LLMs)生成的生物序列嵌入来增强GNNs。这些嵌入编码了DNA、RNA和蛋白质的结构与生物学特性,为分析样本特异性多组学数据提供了丰富的特征补充。通过整合拓扑信息、序列衍生信息以及生物学信息,GraphSeqLM展现出卓越的预测准确性,其性能优于现有方法,为精准医学中更有效的多组学数据整合开辟了新途径。