Molecule discovery plays a crucial role in various scientific fields, advancing the design of tailored materials and drugs. Traditional methods for molecule discovery follow a trial-and-error process, which are both time-consuming and costly, while computational approaches such as artificial intelligence (AI) have emerged as revolutionary tools to expedite various tasks, like molecule-caption translation. Despite the importance of molecule-caption translation for molecule discovery, most of the existing methods heavily rely on domain experts, require excessive computational cost, and suffer from poor performance. On the other hand, Large Language Models (LLMs), like ChatGPT, have shown remarkable performance in various cross-modal tasks due to their great powerful capabilities in natural language understanding, generalization, and reasoning, which provides unprecedented opportunities to advance molecule discovery. To address the above limitations, in this work, we propose a novel LLMs-based framework (\textbf{MolReGPT}) for molecule-caption translation, where a retrieval-based prompt paradigm is introduced to empower molecule discovery with LLMs like ChatGPT without fine-tuning. More specifically, MolReGPT leverages the principle of molecular similarity to retrieve similar molecules and their text descriptions from a local database to ground the generation of LLMs through in-context few-shot molecule learning. We evaluate the effectiveness of MolReGPT via molecule-caption translation, which includes molecule understanding and text-based molecule generation. Experimental results show that MolReGPT outperforms fine-tuned models like MolT5-base without any additional training. To the best of our knowledge, MolReGPT is the first work to leverage LLMs in molecule-caption translation for advancing molecule discovery.
翻译:分子发现在推动定制化材料与药物设计等多个科学领域具有关键作用。传统分子发现方法依赖试错流程,耗时且成本高昂,而人工智能等计算手段已作为革命性工具涌现,可加速分子-描述翻译等多种任务。尽管分子-描述翻译对分子发现至关重要,但现有方法大多严重依赖领域专家、计算成本过高且性能欠佳。另一方面,ChatGPT等大型语言模型凭借其在自然语言理解、泛化与推理方面的强大能力,已在多种跨模态任务中展现卓越性能,为推进分子发现提供了前所未有的机遇。为解决上述局限,本研究提出一种基于大型语言模型的新型框架(\textbf{MolReGPT})用于分子-描述翻译,该框架引入基于检索的提示范式,无需微调即可通过ChatGPT等大型语言模型赋能分子发现。具体而言,MolReGPT利用分子相似性原理,从本地数据库中检索相似分子及其文本描述,通过上下文少样本分子学习为大型语言模型的生成提供基础。我们通过分子-描述翻译(包括分子理解与基于文本的分子生成)评估了MolReGPT的有效性。实验结果表明,MolReGPT无需额外训练即可超越MolT5-base等微调模型。据我们所知,MolReGPT是首个将大型语言模型用于分子-描述翻译以推动分子发现的研究工作。