Protein function prediction is currently achieved by encoding its sequence or structure, where the sequence-to-function transcendence and high-quality structural data scarcity lead to obvious performance bottlenecks. Protein domains are "building blocks" of proteins that are functionally independent, and their combinations determine the diverse biological functions. However, most existing studies have yet to thoroughly explore the intricate functional information contained in the protein domains. To fill this gap, we propose a synergistic integration approach for a function-aware domain representation, and a domain-joint contrastive learning strategy to distinguish different protein functions while aligning the modalities. Specifically, we align the domain semantics with GO terms and text description to pre-train domain embeddings. Furthermore, we partition proteins into multiple sub-views based on continuous joint domains for contrastive training under the supervision of a novel triplet InfoNCE loss. Our approach significantly and comprehensively outperforms the state-of-the-art methods on various benchmarks, and clearly differentiates proteins carrying distinct functions compared to the competitor. Our implementation is available at https://github.com/AI-HPC-Research-Team/ProtFAD.
翻译:蛋白质功能预测目前通过编码其序列或结构实现,其中序列到功能的跨越以及高质量结构数据的稀缺性导致了明显的性能瓶颈。蛋白质结构域是蛋白质的功能独立“构建模块”,其组合决定了多样的生物学功能。然而,现有研究大多尚未深入探索蛋白质结构域所蕴含的复杂功能信息。为填补这一空白,我们提出了一种协同整合方法,用于构建功能感知的结构域表示,并采用一种结构域联合对比学习策略,以区分不同的蛋白质功能并实现模态对齐。具体而言,我们将结构域语义与GO术语及文本描述对齐,以预训练结构域嵌入。此外,我们基于连续联合结构域将蛋白质划分为多个子视图,并在一种新颖的三元组InfoNCE损失监督下进行对比训练。我们的方法在各种基准测试中显著且全面地超越了现有最优方法,并在区分具有不同功能的蛋白质方面明显优于竞争对手。我们的实现可在 https://github.com/AI-HPC-Research-Team/ProtFAD 获取。