Natural language is expected to be a key medium for various human-machine interactions in the era of large language models. When it comes to the biochemistry field, a series of tasks around molecules (e.g., property prediction, molecule mining, etc.) are of great significance while having a high technical threshold. Bridging the molecule expressions in natural language and chemical language can not only hugely improve the interpretability and reduce the operation difficulty of these tasks, but also fuse the chemical knowledge scattered in complementary materials for a deeper comprehension of molecules. Based on these benefits, we propose the conversational molecular design, a novel task adopting natural language for describing and editing target molecules. To better accomplish this task, we design ChatMol, a knowledgeable and versatile generative pre-trained model, enhanced by injecting experimental property information, molecular spatial knowledge, and the associations between natural and chemical languages into it. Several typical solutions including large language models (e.g., ChatGPT) are evaluated, proving the challenge of conversational molecular design and the effectiveness of our knowledge enhancement method. Case observations and analysis are conducted to provide directions for further exploration of natural-language interaction in molecular discovery.
翻译:自然语言预计将成为大语言模型时代各类人机交互的关键媒介。在生物化学领域,围绕分子的一系列任务(如性质预测、分子挖掘等)具有重大意义,但技术门槛较高。将自然语言与化学语言中的分子表达相衔接,不仅能够显著提升这些任务的可解释性并降低操作难度,还能融合分散在互补材料中的化学知识,以实现对分子的更深层理解。基于这些优势,我们提出了对话式分子设计这一新任务,采用自然语言来描述和编辑目标分子。为更好完成该任务,我们设计了ChatMol——一种知识丰富且多功能的生成式预训练模型,通过注入实验性质信息、分子空间知识以及自然语言与化学语言之间的关联来增强模型能力。我们评估了包括大型语言模型(如ChatGPT)在内的若干典型解决方案,验证了对话式分子设计的挑战性以及我们知识增强方法的有效性。通过案例观察与分析,我们为分子发现中自然语言交互的进一步探索指明了方向。