Antibodies are vital proteins offering robust protection for the human body from pathogens. The development of general protein and antibody-specific pre-trained language models both facilitate antibody prediction tasks. However, few studies comprehensively explore the representation capability of distinct pre-trained language models on different antibody problems. Here, to investigate the problem, we aim to answer the following key questions: (1) How do pre-trained language models perform in antibody tasks with different specificity? (2) How many benefits will the model gain if we introduce the specific biological mechanism to the pre-training process? (3) Do the learned antibody pre-trained representations make sense in real-world antibody problems, like drug discovery and immune process understanding? Previously, no benchmark available largely hindered the study to answer these questions. To facilitate the investigation, we provide an AnTibody Understanding Evaluation (ATUE) benchmark. We comprehensively evaluate the performance of protein pre-trained language models by empirical study along with conclusions and new insights. Our ATUE and code are released at https://github.com/dqwang122/EATLM.
翻译:抗体是为人体提供强效病原体防护的关键蛋白质。通用蛋白质及抗体特异性预训练语言模型的发展共同推动了抗体预测任务。然而,目前鲜有研究全面探索不同预训练语言模型在各类抗体问题中的表征能力。为解决该问题,本文旨在回答以下关键问题:(1)预训练语言模型在不同特异性程度的抗体任务中表现如何?(2)若将特定生物学机制引入预训练过程,模型性能可获得多大提升?(3)学习到的抗体预训练表征能否在真实抗体问题(如药物发现与免疫过程理解)中产生实际价值?此前缺乏系统基准测试极大阻碍了相关研究。为此,我们构建了抗体理解评估基准(ATUE),通过实证研究全面评估蛋白质预训练语言模型性能,并得出相应结论与新见解。我们的ATUE基准与代码已开源至https://github.com/dqwang122/EATLM。