ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF Synthesis

We use prompt engineering to guide ChatGPT in the automation of text mining of metal-organic frameworks (MOFs) synthesis conditions from diverse formats and styles of the scientific literature. This effectively mitigates ChatGPT's tendency to hallucinate information -- an issue that previously made the use of Large Language Models (LLMs) in scientific fields challenging. Our approach involves the development of a workflow implementing three different processes for text mining, programmed by ChatGPT itself. All of them enable parsing, searching, filtering, classification, summarization, and data unification with different tradeoffs between labor, speed, and accuracy. We deploy this system to extract 26,257 distinct synthesis parameters pertaining to approximately 800 MOFs sourced from peer-reviewed research articles. This process incorporates our ChemPrompt Engineering strategy to instruct ChatGPT in text mining, resulting in impressive precision, recall, and F1 scores of 90-99%. Furthermore, with the dataset built by text mining, we constructed a machine-learning model with over 86% accuracy in predicting MOF experimental crystallization outcomes and preliminarily identifying important factors in MOF crystallization. We also developed a reliable data-grounded MOF chatbot to answer questions on chemical reactions and synthesis procedures. Given that the process of using ChatGPT reliably mines and tabulates diverse MOF synthesis information in a unified format, while using only narrative language requiring no coding expertise, we anticipate that our ChatGPT Chemistry Assistant will be very useful across various other chemistry sub-disciplines.

翻译：我们通过提示工程引导ChatGPT，实现从科学文献中多样化格式与风格的金属有机框架（MOFs）合成条件的自动化文本挖掘。该方法有效抑制了ChatGPT在信息处理过程中的幻觉倾向——这一缺陷此前使得大语言模型（LLMs）在科学领域的应用面临挑战。我们开发了一个包含三种文本挖掘流程的工作系统（由ChatGPT自身编程实现），这些流程能分别实现解析、搜索、过滤、分类、总结与数据统一化，并在人力、速度与准确性之间实现不同权衡。我们部署该系统从同行评审研究论文中提取了约800种MOF的26,257个不同合成参数。该过程融入我们的ChemPrompt工程策略指导ChatGPT进行文本挖掘，最终获得90-99%的惊人精确率、召回率与F1得分。此外，基于文本挖掘构建的数据集，我们建立了预测MOF实验结晶结果的机器学习模型（准确率超过86%），并初步识别出MOF结晶过程中的关键影响因素。我们还开发了基于可靠数据的MOF聊天机器人，可解答化学反应与合成流程相关问题。鉴于该过程能通过无需编程技能的自然语言指令，以统一格式可靠挖掘并制表多样化MOF合成信息，我们预计ChatGPT化学助手将广泛应用于其他化学子学科领域。