Large Language Models (LLMs) create exciting possibilities for powerful language processing tools to accelerate research in materials science. While LLMs have great potential to accelerate materials understanding and discovery, they currently fall short in being practical materials science tools. In this position paper, we show relevant failure cases of LLMs in materials science that reveal current limitations of LLMs related to comprehending and reasoning over complex, interconnected materials science knowledge. Given those shortcomings, we outline a framework for developing Materials Science LLMs (MatSci-LLMs) that are grounded in materials science knowledge and hypothesis generation followed by hypothesis testing. The path to attaining performant MatSci-LLMs rests in large part on building high-quality, multi-modal datasets sourced from scientific literature where various information extraction challenges persist. As such, we describe key materials science information extraction challenges which need to be overcome in order to build large-scale, multi-modal datasets that capture valuable materials science knowledge. Finally, we outline a roadmap for applying future MatSci-LLMs for real-world materials discovery via: 1. Automated Knowledge Base Generation; 2. Automated In-Silico Material Design; and 3. MatSci-LLM Integrated Self-Driving Materials Laboratories.
翻译:大型语言模型(LLMs)为利用强大的语言处理工具加速材料科学研究带来了令人振奋的可能性。尽管LLMs在加速材料理解与发现方面潜力巨大,但目前尚未成为实用的材料科学工具。在本立场论文中,我们展示了LLMs在材料科学领域的若干典型失败案例,揭示了当前LLMs在理解和推理复杂、相互关联的材料科学知识方面存在的局限性。基于这些不足,我们提出了一个开发材料科学大型语言模型(MatSci-LLMs)的框架,该框架需以材料科学知识为基础,并遵循“假设生成→假设验证”的研究范式。构建高性能MatSci-LLMs的关键路径,很大程度上取决于从科学文献中构建高质量多模态数据集——而当前各类信息抽取任务仍面临持续挑战。为此,我们阐述了构建能够捕获重要材料科学知识的大规模多模态数据集必须攻克的关键信息抽取难题。最后,我们规划了未来MatSci-LLMs应用于现实世界材料发现的路线图,包括:1. 自动化知识库生成;2. 自动化硅基材料设计;3. 集成MatSci-LLM的自驱动材料实验室。