Computational prediction of the interaction of T cell receptors (TCRs) and their ligands is a grand challenge in immunology. Despite advances in high-throughput assays, specificity-labelled TCR data remains sparse. In other domains, the pre-training of language models on unlabelled data has been successfully used to address data bottlenecks. However, it is unclear how to best pre-train protein language models for TCR specificity prediction. Here we introduce a TCR language model called SCEPTR (Simple Contrastive Embedding of the Primary sequence of T cell Receptors), capable of data-efficient transfer learning. Through our model, we introduce a novel pre-training strategy combining autocontrastive learning and masked-language modelling, which enables SCEPTR to achieve its state-of-the-art performance. In contrast, existing protein language models and a variant of SCEPTR pre-trained without autocontrastive learning are outperformed by sequence alignment-based methods. We anticipate that contrastive learning will be a useful paradigm to decode the rules of TCR specificity.
翻译:T细胞受体(TCR)与其配体相互作用的计算预测是免疫学领域的一项重大挑战。尽管高通量检测技术取得了进展,但带有特异性标签的TCR数据仍然稀少。在其他领域,基于无标签数据的语言模型预训练已成功用于解决数据瓶颈问题。然而,如何最优地预训练蛋白质语言模型以用于TCR特异性预测仍不明确。本文提出了一种名为SCEPTR(T细胞受体初级序列的简单对比嵌入)的TCR语言模型,该模型具备数据高效的迁移学习能力。通过该模型,我们引入了一种结合自对比学习与掩码语言建模的新型预训练策略,使SCEPTR能够达到当前最优性能。相比之下,现有的蛋白质语言模型以及未使用自对比学习预训练的SCEPTR变体,其表现均不如基于序列比对的方法。我们预计对比学习将成为解码TCR特异性规则的有用范式。