Antibodies are proteins produced by the immune system that can identify and neutralise a wide variety of antigens with high specificity and affinity, and constitute the most successful class of biotherapeutics. With the advent of next-generation sequencing, billions of antibody sequences have been collected in recent years, though their application in the design of better therapeutics has been constrained by the sheer volume and complexity of the data. To address this challenge, we present IgBert and IgT5, the best performing antibody-specific language models developed to date which can consistently handle both paired and unpaired variable region sequences as input. These models are trained comprehensively using the more than two billion unpaired sequences and two million paired sequences of light and heavy chains present in the Observed Antibody Space dataset. We show that our models outperform existing antibody and protein language models on a diverse range of design and regression tasks relevant to antibody engineering. This advancement marks a significant leap forward in leveraging machine learning, large scale data sets and high-performance computing for enhancing antibody design for therapeutic development.
翻译:抗体是免疫系统产生的蛋白质,能够以高特异性和亲和力识别并中和多种抗原,构成了最成功的生物治疗药物类别。随着新一代测序技术的出现,近年来已收集了数十亿条抗体序列,然而这些数据在优化治疗药物设计中的应用受限于其庞大的规模和复杂性。为解决这一挑战,我们提出了IgBert和IgT5——目前性能最佳的抗体特异性语言模型,能够稳定处理配对与非配对的可变区序列输入。这些模型使用Observed Antibody Space数据集中超过20亿条非配对序列和200万条轻链与重链配对序列进行了全面训练。研究表明,在与抗体工程相关的多样化设计任务和回归任务中,我们的模型优于现有抗体及蛋白质语言模型。这一进展标志着在利用机器学习、大规模数据集和高性能计算提升治疗药物开发中抗体设计能力方面迈出了重要一步。