Predicting which proteins interact together from amino-acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments, such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called DiffPALM that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids. We show that it captures inter-chain coevolution, while it was trained on single-chain data, which means that it can be used out-of-distribution. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer, without significantly deteriorating any of those we tested. It also achieves competitive performance with using orthology-based pairing.
翻译:从氨基酸序列预测哪些蛋白质相互相互作用是一项重要任务。我们开发了一种配对相互作用蛋白质序列的方法,该方法利用了基于多序列比对训练的蛋白质语言模型(如MSA Transformer和AlphaFold的EvoFormer模块)的强大能力。我们以可微的方式形式化了在两个蛋白质家族旁系同源物中配对相互作用伙伴的问题。我们提出了一种名为DiffPALM的方法,该方法通过利用MSA Transformer利用周围上下文填充多序列比对中掩码氨基酸的能力来解决该问题。MSA Transformer编码了功能或结构耦合氨基酸之间的共进化信息。我们证明它能够捕获链间共进化,尽管它是在单链数据上训练的,这意味着它可以被用于分布外场景。无需微调,DiffPALM在从普遍存在的原核生物蛋白质数据集中提取的浅层多序列比对组成的困难基准测试中,优于现有的基于共进化的配对方法。它还优于一种基于最先进的在单序列上训练的蛋白质语言模型的替代方法。配对的相互作用蛋白质序列比对是监督深度学习方法预测蛋白质复合物三维结构的关键要素。DiffPALM显著改善了AlphaFold-Multimer对某些真核生物蛋白质复合物的结构预测,且未显著降低我们测试的任何复合物的预测质量。它还实现了与基于直系同源配对方法相媲美的性能。