Protein language models (LMs) have been successful in sequence, structural and functional predictions. However, currently, protein LMs are limited to encoder- or decoder-only architectures for single sequences while many biological contexts involve protein-protein interactions. Here, we introduce pAbT5, which models antibody chain pairing as forward- and back-translations using a T5-based architecture. We show that pAbT5 accurately reflects chain pairing through sequence generation. Our protein LM generates variable-length sequences and its next-word prediction probability agrees with position-specific scoring matrix from sequence alignment. Like other works in protein LM, pAbT5 performs state-of-the-art unsupervised prediction on experimental measurements. To the best of our knowledge, pAbT5 is the first generative encoder-decoder protein LM for protein-protein interactions.
翻译:蛋白质语言模型在序列、结构和功能预测方面取得了成功。然而,目前蛋白质语言模型仅限于处理单序列的纯编码器或纯解码器架构,而许多生物学情境涉及蛋白质间相互作用。本文提出的pAbT5采用基于T5的架构,将抗体链配对建模为正向翻译和反向翻译过程。我们证明,pAbT5能够通过序列生成准确反映抗体链配对关系。该蛋白质语言模型可生成长度可变的序列,其下一个词预测概率与序列比对的位置特异性评分矩阵具有一致性。与蛋白质语言模型领域的其他研究类似,pAbT5在实验测量数据的无监督预测中达到了最优性能。据我们所知,pAbT5是首个面向蛋白质间相互作用的生成式编码器-解码器蛋白质语言模型。