Protein language models (pLMs), pre-trained via causal language modeling on protein sequences, have been a promising tool for protein sequence design. In real-world protein engineering, there are many cases where the amino acids in the middle of a protein sequence are optimized while maintaining other residues. Unfortunately, because of the left-to-right nature of pLMs, existing pLMs modify suffix residues by prompting prefix residues, which are insufficient for the infilling task that considers the whole surrounding context. To find the more effective pLMs for protein engineering, we design a new benchmark, Secondary structureE InFilling rEcoveRy, SEIFER, which approximates infilling sequence design scenarios. With the evaluation of existing models on the benchmark, we reveal the weakness of existing language models and show that language models trained via fill-in-middle transformation, called ProtFIM, are more appropriate for protein engineering. Also, we prove that ProtFIM generates protein sequences with decent protein representations through exhaustive experiments and visualizations.
翻译:蛋白质语言模型(pLMs)通过因果语言建模在蛋白质序列上进行预训练,已成为蛋白质序列设计的有力工具。在实际的蛋白质工程中,许多场景需要优化蛋白质序列中间部分的氨基酸,同时维持其余残基不变。然而,由于pLMs固有的从左到右生成特性,现有pLMs仅通过前缀残基引导后缀残基的修改,难以胜任需考虑完整上下文环境的填充任务。为寻找更适用于蛋白质工程的pLMs,我们设计了新基准SEIFER(二级结构填充恢复),用于近似模拟蛋白质填充序列设计场景。通过在该基准上评估现有模型,我们揭示了当前语言模型的局限性,并证明采用中间填充变换训练的语言模型(称为ProtFIM)更适用于蛋白质工程。此外,通过大量实验和可视化分析,我们证实ProtFIM生成的蛋白质序列具有优异的蛋白质表征能力。