Multimodal protein language models deliver strong performance on mutation-effect prediction, but training such models from scratch demands substantial computational resources. In this paper, we propose a fine-tuning framework called InstructPLM-mu and try to answer a question: \textit{Can multimodal fine-tuning of a pretrained, sequence-only protein language model match the performance of models trained end-to-end? } Surprisingly, our experiments show that fine-tuning ESM2 with structural inputs can reach performance comparable to ESM3. To understand how this is achieved, we systematically compare three different feature-fusion designs and fine-tuning recipes. Our results reveal that both the fusion method and the tuning strategy strongly affect final accuracy, indicating that the fine-tuning process is not trivial. We hope this work offers practical guidance for injecting structure into pretrained protein language models and motivates further research on better fusion mechanisms and fine-tuning protocols.
翻译:多模态蛋白质语言模型在突变效应预测任务上展现出强大性能,但从头训练此类模型需要大量计算资源。本文提出一种称为 InstructPLM-mu 的微调框架,并尝试回答以下问题:\textit{对预训练的、仅基于序列的蛋白质语言模型进行多模态微调,能否达到端到端训练模型的性能?} 令人惊讶的是,我们的实验表明,为 ESM2 引入结构输入进行微调,其性能可达到与 ESM3 相当的水平。为理解其实现机制,我们系统比较了三种不同的特征融合设计与微调方案。结果表明,融合方法与调优策略均对最终精度有显著影响,说明微调过程并非简单直接。我们希望这项工作能为将结构信息注入预训练蛋白质语言模型提供实用指导,并激励对更优融合机制与微调方案的进一步研究。