Materials discovery and development are critical for addressing global challenges. Yet, the exponential growth in materials science literature comprising vast amounts of textual data has created significant bottlenecks in knowledge extraction, synthesis, and scientific reasoning. Large Language Models (LLMs) offer unprecedented opportunities to accelerate materials research through automated analysis and prediction. Still, their effective deployment requires domain-specific adaptation for understanding and solving domain-relevant tasks. Here, we present LLaMat, a family of foundational models for materials science developed through continued pretraining of LLaMA models on an extensive corpus of materials literature and crystallographic data. Through systematic evaluation, we demonstrate that LLaMat excels in materials-specific NLP and structured information extraction while maintaining general linguistic capabilities. The specialized LLaMat-CIF variant demonstrates unprecedented capabilities in crystal structure generation, predicting stable crystals with high coverage across the periodic table. Intriguingly, despite LLaMA-3's superior performance in comparison to LLaMA-2, we observe that LLaMat-2 demonstrates unexpectedly enhanced domain-specific performance across diverse materials science tasks, including structured information extraction from text and tables, more particularly in crystal structure generation, a potential adaptation rigidity in overtrained LLMs. Altogether, the present work demonstrates the effectiveness of domain adaptation towards developing practically deployable LLM copilots for materials research. Beyond materials science, our findings reveal important considerations for domain adaptation of LLMs, such as model selection, training methodology, and domain-specific performance, which may influence the development of specialized scientific AI systems.
翻译:材料发现与开发对于应对全球性挑战至关重要。然而,材料科学文献的指数级增长,其中包含海量文本数据,给知识提取、综合与科学推理带来了显著的瓶颈。大语言模型(LLMs)通过自动化分析与预测,为加速材料研究提供了前所未有的机遇。然而,要有效部署这些模型,需要针对特定领域进行适应性调整,以理解和解决领域相关任务。本文介绍了LLaMat,一个面向材料科学的基础模型系列,该系列模型通过对LLaMA模型在广泛的材料文献和晶体学数据语料库上进行持续预训练而开发。通过系统评估,我们证明LLaMat在材料特定的自然语言处理和结构化信息提取方面表现出色,同时保持了通用语言能力。其专门变体LLaMat-CIF在晶体结构生成方面展现出前所未有的能力,能够预测覆盖周期表大部分范围的稳定晶体。有趣的是,尽管LLaMA-3在整体性能上优于LLaMA-2,但我们观察到LLaMat-2在多种材料科学任务中,尤其是在从文本和表格中提取结构化信息,以及更具体地在晶体结构生成方面,表现出意料之外的增强领域特定性能,这可能是过度训练的大语言模型中存在的适应性僵化现象。总之,本工作证明了领域适应对于开发可实际部署的材料研究大语言模型助手是有效的。除了材料科学,我们的发现揭示了大语言模型领域适应的重要考量因素,如模型选择、训练方法和领域特定性能,这些可能影响专门化科学人工智能系统的开发。