We investigate the potential of patent data for improving the antibody humanness prediction using a multi-stage, multi-loss training process. Humanness serves as a proxy for the immunogenic response to antibody therapeutics, one of the major causes of attrition in drug discovery and a challenging obstacle for their use in clinical settings. We pose the initial learning stage as a weakly-supervised contrastive-learning problem, where each antibody sequence is associated with possibly multiple identifiers of function and the objective is to learn an encoder that groups them according to their patented properties. We then freeze a part of the contrastive encoder and continue training it on the patent data using the cross-entropy loss to predict the humanness score of a given antibody sequence. We illustrate the utility of the patent data and our approach by performing inference on three different immunogenicity datasets, unseen during training. Our empirical results demonstrate that the learned model consistently outperforms the alternative baselines and establishes new state-of-the-art on five out of six inference tasks, irrespective of the used metric.
翻译:本研究探讨了利用专利数据通过多阶段、多损失训练流程改进抗体人源化程度预测的潜力。人源化程度可作为抗体疗法免疫原性反应的代用指标,该反应是药物研发过程中损耗的主要原因之一,也是其临床应用面临的重要挑战。我们将初始学习阶段构建为弱监督对比学习问题,其中每个抗体序列可能关联多个功能标识符,目标是学习能够根据其专利特性进行分组的编码器。随后冻结对比编码器的部分参数,继续使用交叉熵损失在专利数据上进行训练,以预测给定抗体序列的人源化评分。通过在训练阶段未见过的三个不同免疫原性数据集上进行推理,我们验证了专利数据及所提方法的实用性。实验结果表明,学习得到的模型在不同评估指标下均持续优于现有基线方法,并在六项推理任务中的五项上实现了新的最优性能。