Foundational Large Language Models for Materials Research

Materials discovery and development are critical for addressing global challenges. Yet, the exponential growth in materials science literature comprising vast amounts of textual data has created significant bottlenecks in knowledge extraction, synthesis, and scientific reasoning. Large Language Models (LLMs) offer unprecedented opportunities to accelerate materials research through automated analysis and prediction. Still, their effective deployment requires domain-specific adaptation for understanding and solving domain-relevant tasks. Here, we present LLaMat, a family of foundational models for materials science developed through continued pretraining of LLaMA models on an extensive corpus of materials literature and crystallographic data. Through systematic evaluation, we demonstrate that LLaMat excels in materials-specific NLP and structured information extraction while maintaining general linguistic capabilities. The specialized LLaMat-CIF variant demonstrates unprecedented capabilities in crystal structure generation, predicting stable crystals with high coverage across the periodic table. Intriguingly, despite LLaMA-3's superior performance in comparison to LLaMA-2, we observe that LLaMat-2 demonstrates unexpectedly enhanced domain-specific performance across diverse materials science tasks, including structured information extraction from text and tables, more particularly in crystal structure generation, a potential adaptation rigidity in overtrained LLMs. Altogether, the present work demonstrates the effectiveness of domain adaptation towards developing practically deployable LLM copilots for materials research. Beyond materials science, our findings reveal important considerations for domain adaptation of LLMs, such as model selection, training methodology, and domain-specific performance, which may influence the development of specialized scientific AI systems.

翻译：材料发现与开发对于应对全球性挑战至关重要。然而，材料科学文献的指数级增长，其中包含海量文本数据，给知识提取、综合与科学推理带来了显著的瓶颈。大语言模型（LLMs）通过自动化分析与预测，为加速材料研究提供了前所未有的机遇。然而，要有效部署这些模型，需要针对特定领域进行适应性调整，以理解和解决领域相关任务。本文介绍了LLaMat，一个面向材料科学的基础模型系列，该系列模型通过对LLaMA模型在广泛的材料文献和晶体学数据语料库上进行持续预训练而开发。通过系统评估，我们证明LLaMat在材料特定的自然语言处理和结构化信息提取方面表现出色，同时保持了通用语言能力。其专门变体LLaMat-CIF在晶体结构生成方面展现出前所未有的能力，能够预测覆盖周期表大部分范围的稳定晶体。有趣的是，尽管LLaMA-3在整体性能上优于LLaMA-2，但我们观察到LLaMat-2在多种材料科学任务中，尤其是在从文本和表格中提取结构化信息，以及更具体地在晶体结构生成方面，表现出意料之外的增强领域特定性能，这可能是过度训练的大语言模型中存在的适应性僵化现象。总之，本工作证明了领域适应对于开发可实际部署的材料研究大语言模型助手是有效的。除了材料科学，我们的发现揭示了大语言模型领域适应的重要考量因素，如模型选择、训练方法和领域特定性能，这些可能影响专门化科学人工智能系统的开发。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日