Protein language models have demonstrated significant potential in the field of protein engineering. However, current protein language models primarily operate at the residue scale, which limits their ability to provide information at the atom level. This limitation prevents us from fully exploiting the capabilities of protein language models for applications involving both proteins and small molecules. In this paper, we propose ESM-AA (ESM All-Atom), a novel approach that enables atom-scale and residue-scale unified molecular modeling. ESM-AA achieves this by pre-training on multi-scale code-switch protein sequences and utilizing a multi-scale position encoding to capture relationships among residues and atoms. Experimental results indicate that ESM-AA surpasses previous methods in protein-molecule tasks, demonstrating the full utilization of protein language models. Further investigations reveal that through unified molecular modeling, ESM-AA not only gains molecular knowledge but also retains its understanding of proteins. The source codes of ESM-AA are publicly released at https://github.com/zhengkangjie/ESM-AA.
翻译:蛋白质语言模型已在蛋白质工程领域展现出巨大潜力。然而,当前的蛋白质语言模型主要在残基尺度上运行,这限制了其在原子层面提供信息的能力。这一局限性阻碍了我们充分挖掘蛋白质语言模型在涉及蛋白质与小分子的应用中的潜力。本文提出ESM-AA(ESM全原子模型),这是一种能够在原子尺度与残基尺度上进行统一分子建模的新方法。ESM-AA通过对多尺度语码转换蛋白质序列进行预训练,并利用多尺度位置编码来捕获残基与原子之间的关系,从而实现这一目标。实验结果表明,ESM-AA在蛋白质-分子相关任务上超越了先前的方法,充分展现了蛋白质语言模型的利用潜力。进一步研究表明,通过统一分子建模,ESM-AA不仅获得了分子知识,同时保留了对蛋白质的理解。ESM-AA的源代码已公开于 https://github.com/zhengkangjie/ESM-AA。