We investigate the potential of patent data for improving the antibody humanness prediction using a multi-stage, multi-loss training process. Humanness serves as a proxy for the immunogenic response to antibody therapeutics, one of the major causes of attrition in drug discovery and a challenging obstacle for their use in clinical settings. We pose the initial learning stage as a weakly-supervised contrastive-learning problem, where each antibody sequence is associated with possibly multiple identifiers of function and the objective is to learn an encoder that groups them according to their patented properties. We then freeze a part of the contrastive encoder and continue training it on the patent data using the cross-entropy loss to predict the humanness score of a given antibody sequence. We illustrate the utility of the patent data and our approach by performing inference on three different immunogenicity datasets, unseen during training. Our empirical results demonstrate that the learned model consistently outperforms the alternative baselines and establishes new state-of-the-art on five out of six inference tasks, irrespective of the used metric.
翻译:我们探究了利用专利数据通过多阶段、多损失训练过程提升抗体人源化预测的潜力。人源化作为衡量抗体治疗药物免疫原性的替代指标,是药物研发中导致失败的主要原因之一,也是临床应用中面临的关键障碍。我们将初始学习阶段设定为弱监督对比学习问题:每个抗体序列可能关联多个功能标识符,学习目标在于训练编码器根据其专利属性对序列进行聚类。随后冻结部分对比编码器参数,继续利用专利数据通过交叉熵损失训练模型,以预测给定抗体序列的人源化评分。通过在三个未见于训练过程的免疫原性数据集上进行推理验证,我们展示了专利数据及所提出方法的实用价值。实验结果表明,无论采用何种评估指标,该学习模型在六项推理任务中的五项上始终优于基线方法,并达到了新的最优水平。