Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective

Molecule discovery plays a crucial role in various scientific fields, advancing the design of tailored materials and drugs. Traditional methods for molecule discovery follow a trial-and-error process, which are both time-consuming and costly, while computational approaches such as artificial intelligence (AI) have emerged as revolutionary tools to expedite various tasks, like molecule-caption translation. Despite the importance of molecule-caption translation for molecule discovery, most of the existing methods heavily rely on domain experts, require excessive computational cost, and suffer from poor performance. On the other hand, Large Language Models (LLMs), like ChatGPT, have shown remarkable performance in various cross-modal tasks due to their great powerful capabilities in natural language understanding, generalization, and reasoning, which provides unprecedented opportunities to advance molecule discovery. To address the above limitations, in this work, we propose a novel LLMs-based framework (\textbf{MolReGPT}) for molecule-caption translation, where a retrieval-based prompt paradigm is introduced to empower molecule discovery with LLMs like ChatGPT without fine-tuning. More specifically, MolReGPT leverages the principle of molecular similarity to retrieve similar molecules and their text descriptions from a local database to ground the generation of LLMs through in-context few-shot molecule learning. We evaluate the effectiveness of MolReGPT via molecule-caption translation, which includes molecule understanding and text-based molecule generation. Experimental results show that MolReGPT outperforms fine-tuned models like MolT5-base without any additional training. To the best of our knowledge, MolReGPT is the first work to leverage LLMs in molecule-caption translation for advancing molecule discovery.

翻译：分子发现在推动定制化材料与药物设计等多个科学领域具有关键作用。传统分子发现方法依赖试错流程，耗时且成本高昂，而人工智能等计算手段已作为革命性工具涌现，可加速分子-描述翻译等多种任务。尽管分子-描述翻译对分子发现至关重要，但现有方法大多严重依赖领域专家、计算成本过高且性能欠佳。另一方面，ChatGPT等大型语言模型凭借其在自然语言理解、泛化与推理方面的强大能力，已在多种跨模态任务中展现卓越性能，为推进分子发现提供了前所未有的机遇。为解决上述局限，本研究提出一种基于大型语言模型的新型框架（\textbf{MolReGPT}）用于分子-描述翻译，该框架引入基于检索的提示范式，无需微调即可通过ChatGPT等大型语言模型赋能分子发现。具体而言，MolReGPT利用分子相似性原理，从本地数据库中检索相似分子及其文本描述，通过上下文少样本分子学习为大型语言模型的生成提供基础。我们通过分子-描述翻译（包括分子理解与基于文本的分子生成）评估了MolReGPT的有效性。实验结果表明，MolReGPT无需额外训练即可超越MolT5-base等微调模型。据我们所知，MolReGPT是首个将大型语言模型用于分子-描述翻译以推动分子发现的研究工作。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日