Protein representation learning is a challenging task that aims to capture the structure and function of proteins from their amino acid sequences. Previous methods largely ignored the fact that not all amino acids are equally important for protein folding and activity. In this article, we propose a neural clustering framework that can automatically discover the critical components of a protein by considering both its primary and tertiary structure information. Our framework treats a protein as a graph, where each node represents an amino acid and each edge represents a spatial or sequential connection between amino acids. We then apply an iterative clustering strategy to group the nodes into clusters based on their 1D and 3D positions and assign scores to each cluster. We select the highest-scoring clusters and use their medoid nodes for the next iteration of clustering, until we obtain a hierarchical and informative representation of the protein. We evaluate on four protein-related tasks: protein fold classification, enzyme reaction classification, gene ontology term prediction, and enzyme commission number prediction. Experimental results demonstrate that our method achieves state-of-the-art performance.
翻译:蛋白质表示学习是一项挑战性任务,旨在从氨基酸序列中捕获蛋白质的结构与功能。以往方法大多忽略了并非所有氨基酸对蛋白质折叠与活性都同等重要这一事实。本文提出一种神经聚类框架,通过同时考虑蛋白质的一级和三级结构信息,自动发现其关键组分。该框架将蛋白质建模为图结构,其中每个节点代表一个氨基酸,每条边代表氨基酸之间的空间或序列连接。随后采用迭代聚类策略,基于氨基酸的一维和三维位置对节点进行分组,并为每个聚类分配得分。选取得分最高的聚类,利用其中心节点进行下一轮迭代聚类,直至获得蛋白质的分层且信息丰富的表示。我们在四项蛋白质相关任务上进行评估:蛋白质折叠分类、酶反应分类、基因本体术语预测及酶委员会编号预测。实验结果表明,该方法达到了最先进性能。