Large Language Models (LLMs) create exciting possibilities for powerful language processing tools to accelerate research in materials science. While LLMs have great potential to accelerate materials understanding and discovery, they currently fall short in being practical materials science tools. In this position paper, we show relevant failure cases of LLMs in materials science that reveal current limitations of LLMs related to comprehending and reasoning over complex, interconnected materials science knowledge. Given those shortcomings, we outline a framework for developing Materials Science LLMs (MatSci-LLMs) that are grounded in materials science knowledge and hypothesis generation followed by hypothesis testing. The path to attaining performant MatSci-LLMs rests in large part on building high-quality, multi-modal datasets sourced from scientific literature where various information extraction challenges persist. As such, we describe key materials science information extraction challenges which need to be overcome in order to build large-scale, multi-modal datasets that capture valuable materials science knowledge. Finally, we outline a roadmap for applying future MatSci-LLMs for real-world materials discovery via: 1. Automated Knowledge Base Generation; 2. Automated In-Silico Material Design; and 3. MatSci-LLM Integrated Self-Driving Materials Laboratories.
翻译:大型语言模型(LLMs)为开发强大的语言处理工具、加速材料科学研究带来了令人振奋的可能性。尽管LLMs在促进材料理解与发现方面潜力巨大,但目前仍未成为实用的材料科学工具。在这篇立场论文中,我们展示了LLMs在材料科学中的典型失败案例,揭示了其在理解和推理复杂、相互关联的材料科学知识方面的当前局限性。针对这些不足,我们提出了一套面向材料科学的LLMs(MatSci-LLMs)开发框架,该框架以材料科学知识为基础,遵循假说生成与假说验证的路径。构建高性能的MatSci-LLMs在很大程度上依赖于从科学文献中获取高质量、多模态数据集,但目前仍面临各种信息提取挑战。因此,我们阐述了材料科学信息提取的关键难题,这些难题必须克服才能构建大规模、多模态的数据集,从而捕获有价值的材料科学知识。最后,我们提出了将未来MatSci-LLMs应用于真实材料科学发现的路线图:1. 自动化知识库生成;2. 自动化计算材料设计;3. MatSci-LLM集成的自主材料实验室。