In recent years, there has been a surge in the development of 3D structure-based pre-trained protein models, representing a significant advancement over pre-trained protein language models in various downstream tasks. However, most existing structure-based pre-trained models primarily focus on the residue level, i.e., alpha carbon atoms, while ignoring other atoms like side chain atoms. We argue that modeling proteins at both residue and atom levels is important since the side chain atoms can also be crucial for numerous downstream tasks, for example, molecular docking. Nevertheless, we find that naively combining residue and atom information during pre-training typically fails. We identify a key reason is the information leakage caused by the inclusion of atom structure in the input, which renders residue-level pre-training tasks trivial and results in insufficiently expressive residue representations. To address this issue, we introduce a span mask pre-training strategy on 3D protein chains to learn meaningful representations of both residues and atoms. This leads to a simple yet effective approach to learning protein representation suitable for diverse downstream tasks. Extensive experimental results on binding site prediction and function prediction tasks demonstrate our proposed pre-training approach significantly outperforms other methods. Our code will be made public.
翻译:摘要:近年来,基于三维结构的蛋白质预训练模型发展迅猛,在多种下游任务中较预训练蛋白质语言模型取得显著进展。然而,现有基于结构的预训练模型主要聚焦于残基层面(即α碳原子),而忽略了侧链原子等其他原子。我们认为在残基与原子双层级建模蛋白质至关重要,因为侧链原子对于分子对接等众多下游任务同样具有关键作用。但研究发现,在预训练过程中简单拼接残基与原子信息通常会导致失败,其核心原因是输入中包含的原子结构引发信息泄露,使得残基级预训练任务变得过于简单,导致残基表征表达能力不足。为解决该问题,我们提出在三维蛋白质链上采用跨度掩码预训练策略,以同时学习残基与原子的有意义的表征。由此形成了一种简洁而有效的蛋白质表征学习方法,适用于多种下游任务。在结合位点预测与功能预测任务上的大量实验表明,我们所提出的预训练方法显著优于其他方法。相关代码将予以公开。