Molecule generation with desired properties has grown immensely in popularity by disruptively changing the way scientists design molecular structures and providing support for chemical and materials design. However, despite the promising outcome, previous machine learning-based deep generative models suffer from a reliance on complex, task-specific fine-tuning, limited dimensional latent spaces, or the quality of expert rules. In this work, we propose MolGen, a pre-trained molecular language model that effectively learns and shares knowledge across multiple generation tasks and domains. Specifically, we pre-train MolGen with the chemical language SELFIES on more than 100 million unlabelled molecules. We further propose multi-task molecular prefix tuning across several molecular generation tasks and different molecular domains (synthetic & natural products) with a self-feedback mechanism. Extensive experiments show that MolGen can obtain superior performances on well-known molecular generation benchmark datasets. The further analysis illustrates that MolGen can accurately capture the distribution of molecules, implicitly learn their structural characteristics, and efficiently explore the chemical space with the guidance of multi-task molecular prefix tuning. Codes, datasets, and the pre-trained model will be available in https://github.com/zjunlp/MolGen.
翻译:具备所需属性的分子生成近年来受到广泛关注,它颠覆性地改变了科学家设计分子结构的方式,并为化学与材料设计提供了支持。然而,尽管取得了令人期待的结果,以往基于机器学习的深度生成模型仍存在对复杂任务特定微调的依赖、潜在空间维度受限或专家规则质量不足等问题。本研究提出MolGen——一个预训练的分子语言模型,该模型能有效学习并在多个生成任务与领域间共享知识。具体而言,我们使用化学语言SELFIES在超过1亿个无标签分子上预训练MolGen,并通过自反馈机制进一步提出多任务分子前缀调优方法,可应用于多种分子生成任务及不同分子领域(合成产物与天然产物)。大量实验表明,MolGen在公认的分子生成基准数据集上取得了优异性能。进一步分析显示,MolGen能精准捕捉分子分布、隐式学习其结构特征,并在多任务分子前缀调优的引导下高效探索化学空间。代码、数据集及预训练模型将发布于https://github.com/zjunlp/MolGen。