Protein language models (LMs) have been successful in sequence, structural and functional predictions. However, currently, protein LMs are limited to encoder- or decoder-only architectures for single sequences while many biological contexts involve protein-protein interactions. Here, we introduce pAbT5, which models antibody chain pairing as forward- and back-translations using a T5-based architecture. We show that pAbT5 accurately reflects chain pairing through sequence generation. Our protein LM generates variable-length sequences and its next-word prediction probability agrees with position-specific scoring matrix from sequence alignment. Like other works in protein LM, pAbT5 performs state-of-the-art unsupervised prediction on experimental measurements. To the best of our knowledge, pAbT5 is the first generative encoder-decoder protein LM for protein-protein interactions.
翻译:蛋白质语言模型已在序列、结构与功能预测领域取得显著成功。然而,现有蛋白质语言模型受限于仅适用于单一序列的编码器或解码器架构,而许多生物学情境涉及蛋白质-蛋白质相互作用。本文提出pAbT5模型,该模型基于T5架构,将抗体链配对建模为正向与反向翻译过程。我们证明,pAbT5通过序列生成能够准确反映抗体链配对关系。该蛋白质语言模型可生成可变长度序列,其下一词预测概率与序列比对所得位置特异性评分矩阵高度吻合。与其他蛋白质语言模型研究类似,pAbT5在实验测量数据的无监督预测中达到当前最优水平。据我们所知,pAbT5是首个面向蛋白质-蛋白质相互作用的生成式编码器-解码器蛋白质语言模型。