A plethora of protein language models have been released in recent years. Yet comparatively little work has addressed how to best sample from them to optimize desired biological properties. We fill this gap by proposing a flexible, effective sampling method for masked language models (MLMs), and by systematically evaluating models and methods both in silico and in vitro on actual antibody therapeutics campaigns. Firstly, we propose sampling with stochastic beam search, exploiting the fact that MLMs are remarkably efficient at evaluating the pseudo-perplexity of the entire 1-edit neighborhood of a sequence. Reframing generation in terms of entire-sequence evaluation enables flexible guidance with multiple optimization objectives. Secondly, we report results from our extensive in vitro head-to-head evaluation for the antibody engineering setting. This reveals that choice of sampling method is at least as impactful as the model used, motivating future research into this under-explored area.
翻译:近年来,大量蛋白质语言模型被发布,但相对较少的工作探讨如何最佳地从这些模型中采样以优化所需的生物学特性。我们通过提出一种灵活、有效的掩码语言模型(MLM)采样方法,并系统地在计算机模拟和实际抗体治疗活动中评估模型与方法,填补了这一空白。首先,我们提出使用随机束搜索进行采样,利用MLM在评估序列完整1-编辑邻域的伪困惑度方面极为高效的特性。将生成过程重新框架为对整个序列的评估,使得能够灵活地结合多个优化目标进行引导。其次,我们报告了在抗体工程背景下进行的大量体外头对头评估结果。这表明采样方法的选择至少与所使用的模型同等重要,从而激励对这一尚未充分探索领域的未来研究。