We investigate the potential of patent data for improving the antibody humanness prediction using a multi-stage, multi-loss training process. Humanness serves as a proxy for the immunogenic response to antibody therapeutics, one of the major causes of attrition in drug discovery and a challenging obstacle for their use in clinical settings. We pose the initial learning stage as a weakly-supervised contrastive-learning problem, where each antibody sequence is associated with possibly multiple identifiers of function and the objective is to learn an encoder that groups them according to their patented properties. We then freeze a part of the contrastive encoder and continue training it on the patent data using the cross-entropy loss to predict the humanness score of a given antibody sequence. We illustrate the utility of the patent data and our approach by performing inference on three different immunogenicity datasets, unseen during training. Our empirical results demonstrate that the learned model consistently outperforms the alternative baselines and establishes new state-of-the-art on five out of six inference tasks, irrespective of the used metric.
翻译:我们研究了利用多阶段、多损失训练过程,采用专利数据改善抗体人源性预测的潜力。人源性可作为抗体治疗药物免疫原性反应的代理指标,而免疫原性反应是导致药物发现失败的主要原因之一,也是在临床应用场景中面临的一大挑战。我们将初始学习阶段设定为一个弱监督对比学习问题,其中每条抗体序列可能对应多个功能标识符,学习目标是建立一个编码器,使其能够根据抗体的专利属性对其进行分组。随后,我们冻结部分对比编码器参数,并在专利数据上继续使用交叉熵损失进行训练,以预测给定抗体序列的人源性得分。我们通过在三个不同的免疫原性数据集(训练阶段未见)上进行推理,验证了专利数据及我们方法的实用性。实验结果表明,所学习的模型在六个推理任务中有五个任务上始终优于基线方法,并达到了最新的最优性能标准,且这一优势与所使用的评价指标无关。