Protein representation learning methods have shown great potential to yield useful representation for many downstream tasks, especially on protein classification. Moreover, a few recent studies have shown great promise in addressing insufficient labels of proteins with self-supervised learning methods. However, existing protein language models are usually pretrained on protein sequences without considering the important protein structural information. To this end, we propose a novel structure-aware protein self-supervised learning method to effectively capture structural information of proteins. In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information with self-supervised tasks from a pairwise residue distance perspective and a dihedral angle perspective, respectively. Furthermore, we propose to leverage the available protein language model pretrained on protein sequences to enhance the self-supervised learning. Specifically, we identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme. Experiments on several supervised downstream tasks verify the effectiveness of our proposed method.The code of the proposed method is available in \url{https://github.com/GGchen1997/STEPS_Bioinformatics}.
翻译:蛋白质表示学习方法在诸多下游任务中展现出巨大的潜力,尤其是在蛋白质分类领域。此外,近期一些研究表明,利用自监督学习方法解决蛋白质标注不足问题前景可观。然而,现有蛋白质语言模型通常在蛋白质序列上进行预训练,未能考虑重要的蛋白质结构信息。为此,我们提出了一种新颖的结构感知蛋白质自监督学习方法,以有效捕捉蛋白质的结构信息。具体而言,我们预训练了一个精心设计的图神经网络(GNN)模型,通过分别从残基对距离和二面角两个视角的自监督任务来保留蛋白质结构信息。进一步地,我们提出利用已在蛋白质序列上预训练的可用蛋白质语言模型来增强自监督学习。具体来说,我们通过一种新颖的伪双层优化方案,识别蛋白质语言模型中序列信息与特制GNN模型中结构信息之间的关联。在多个有监督下游任务上的实验验证了我们提出方法的有效性。所提方法的代码可在\url{https://github.com/GGchen1997/STEPS_Bioinformatics}获取。