This paper demonstrates that language models are strong structure-based protein designers. We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs), that have learned massive sequential evolutionary knowledge from the universe of natural protein sequences, to acquire an immediate capability to design preferable protein sequences for given folds. We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness. During inference, iterative refinement is performed to effectively optimize the generated protein sequences. Experiments show that LM-Design improves the state-of-the-art results by a large margin, leading to up to 4% to 12% accuracy gains in sequence recovery (e.g., 55.65%/56.63% on CATH 4.2/4.3 single-chain benchmarks, and >60% when designing protein complexes). We provide extensive and in-depth analyses, which verify that LM-Design can (1) indeed leverage both structural and sequential knowledge to accurately handle structurally non-deterministic regions, (2) benefit from scaling data and model size, and (3) generalize to other proteins (e.g., antibodies and de novo proteins)
翻译:本文表明语言模型是强大的基于结构的蛋白质设计师。我们提出LM-Design——一种通用方法,通过重新编程基于序列的蛋白质语言模型(pLMs),使其从自然界天然蛋白质序列中学习海量序列进化知识后,能够立即为给定折叠结构设计优选蛋白质序列。我们对pLMs进行结构改造,植入轻量级结构适配器赋予其结构感知能力。在推理阶段,通过迭代精炼有效优化生成的蛋白质序列。实验表明,LM-Design大幅提升现有最佳结果,在序列恢复方面取得4%至12%的准确率提升(例如在CATH 4.2/4.3单链基准上达到55.65%/56.63%,设计蛋白质复合物时超过60%)。我们开展深入分析,证实LM-Design能够:(1)同时利用结构与序列知识精准处理结构非确定性区域;(2)从数据与模型规模扩展中获益;(3)泛化至其他蛋白质(如抗体和从头设计蛋白)。