Activity and property prediction models are the central workhorses in drug discovery and materials sciences, but currently they have to be trained or fine-tuned for new tasks. Without training or fine-tuning, scientific language models could be used for such low-data tasks through their announced zero- and few-shot capabilities. However, their predictive quality at activity prediction is lacking. In this work, we envision a novel type of activity prediction model that is able to adapt to new prediction tasks at inference time, via understanding textual information describing the task. To this end, we propose a new architecture with separate modules for chemical and natural language inputs, and a contrastive pre-training objective on data from large biochemical databases. In extensive experiments, we show that our method CLAMP yields improved predictive performance on few-shot learning benchmarks and zero-shot problems in drug discovery. We attribute the advances of our method to the modularized architecture and to our pre-training objective.
翻译:活性与性质预测模型是药物发现和材料科学领域的核心工具,但目前这类模型需要针对新任务进行训练或微调。无需训练或微调时,科学语言模型可通过其宣称的零样本与小样本能力用于此类低数据任务,但其在活性预测方面的预测质量仍有不足。本研究提出一种新型活性预测模型,该模型能在推理阶段通过理解描述任务的语言信息,自适应地适应新的预测任务。为此,我们设计了一种包含化学输入与自然语言输入独立模块的新型架构,并利用大型生物化学数据库中的数据进行对比预训练。大量实验表明,我们的方法CLAMP在药物发现的小样本学习基准测试和零样本问题中均取得了更优的预测性能。我们将该方法取得的进展归因于模块化架构设计与预训练目标。