Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein functions. Recent sequence representation learning methods based on Protein Language Models (PLMs) excel in sequence-based tasks, but their direct adaptation to tasks involving protein structures remains a challenge. In contrast, structure-based methods leverage 3D structural information with graph neural networks and geometric pre-training methods show potential in function prediction tasks, but still suffers from the limited number of available structures. To bridge this gap, our study undertakes a comprehensive exploration of joint protein representation learning by integrating a state-of-the-art PLM (ESM-2) with distinct structure encoders (GVP, GearNet, CDConv). We introduce three representation fusion strategies and explore different pre-training techniques. Our method achieves significant improvements over existing sequence- and structure-based methods, setting new state-of-the-art for function annotation. This study underscores several important design choices for fusing protein sequence and structure information. Our implementation is available at https://github.com/DeepGraphLearning/ESM-GearNet.
翻译:学习有效的蛋白质表征在生物学中多种任务(如蛋白质功能预测)中至关重要。基于蛋白质语言模型(PLM)的最新序列表征学习方法虽在序列相关任务中表现出色,但其直接适配涉及蛋白质结构的任务仍存在挑战。相比之下,基于结构的方法利用三维结构信息结合图神经网络与几何预训练技术,在功能预测任务中展现出潜力,但受限于可用结构数量不足。为弥合这一差距,本研究通过整合最先进的PLM(ESM-2)与多种结构编码器(GVP、GearNet、CDConv),对蛋白质联合表征学习进行了全面探索。我们提出了三种表征融合策略并探讨了不同预训练技术。所提方法在现有基于序列与结构的方法上取得了显著提升,为功能注释任务设立了新的最优水准。本研究揭示了融合蛋白质序列与结构信息的关键设计选择。我们的实现代码已开源至 https://github.com/DeepGraphLearning/ESM-GearNet。