Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective

Molecule discovery plays a crucial role in various scientific fields, advancing the design of tailored materials and drugs. However, most of the existing methods heavily rely on domain experts, require excessive computational cost, or suffer from sub-optimal performance. On the other hand, Large Language Models (LLMs), like ChatGPT, have shown remarkable performance in various cross-modal tasks due to their powerful capabilities in natural language understanding, generalization, and in-context learning (ICL), which provides unprecedented opportunities to advance molecule discovery. Despite several previous works trying to apply LLMs in this task, the lack of domain-specific corpus and difficulties in training specialized LLMs still remain challenges. In this work, we propose a novel LLM-based framework (MolReGPT) for molecule-caption translation, where an In-Context Few-Shot Molecule Learning paradigm is introduced to empower molecule discovery with LLMs like ChatGPT to perform their in-context learning capability without domain-specific pre-training and fine-tuning. MolReGPT leverages the principle of molecular similarity to retrieve similar molecules and their text descriptions from a local database to enable LLMs to learn the task knowledge from context examples. We evaluate the effectiveness of MolReGPT on molecule-caption translation, including molecule understanding and text-based molecule generation. Experimental results show that compared to fine-tuned models, MolReGPT outperforms MolT5-base and is comparable to MolT5-large without additional training. To the best of our knowledge, MolReGPT is the first work to leverage LLMs via in-context learning in molecule-caption translation for advancing molecule discovery. Our work expands the scope of LLM applications, as well as providing a new paradigm for molecule discovery and design.

翻译：分子发现在多个科学领域发挥着关键作用，推动了定制材料和药物的设计。然而，现有方法大多严重依赖领域专家、计算成本过高或性能欠佳。另一方面，以ChatGPT为代表的大型语言模型（LLM）凭借其在自然语言理解、泛化能力和上下文学习（ICL）方面的强大能力，在各种跨模态任务中展现出卓越性能，这为推进分子发现带来了前所未有的机遇。尽管已有若干研究尝试将LLM应用于此任务，但领域特定语料库的缺乏以及训练专用LLM的困难仍然是挑战。本文提出了一种基于LLM的新型框架（MolReGPT）用于分子-文本翻译，其中引入了上下文少样本分子学习范式，使ChatGPT等LLM能够无需领域预训练和微调即可发挥其上下文学习能力，从而赋能分子发现。MolReGPT利用分子相似性原理，从本地数据库中检索相似分子及其文本描述，使LLM能够从上下文示例中学习任务知识。我们在分子-文本翻译任务上评估了MolReGPT的有效性，包括分子理解和基于文本的分子生成。实验结果表明，与微调模型相比，MolReGPT在无需额外训练的情况下优于MolT5-base，且与MolT5-large性能相当。据我们所知，MolReGPT是首个在分子-文本翻译中利用LLM进行上下文学习以推进分子发现的研究。本工作不仅拓展了LLM的应用范围，也为分子发现与设计提供了新的范式。