HoneyBee: Progressive Instruction Finetuning of Large Language Models for Materials Science

We propose an instruction-based process for trustworthy data curation in materials science (MatSci-Instruct), which we then apply to finetune a LLaMa-based language model targeted for materials science (HoneyBee). MatSci-Instruct helps alleviate the scarcity of relevant, high-quality materials science textual data available in the open literature, and HoneyBee is the first billion-parameter language model specialized to materials science. In MatSci-Instruct we improve the trustworthiness of generated data by prompting multiple commercially available large language models for generation with an Instructor module (e.g. Chat-GPT) and verification from an independent Verifier module (e.g. Claude). Using MatSci-Instruct, we construct a dataset of multiple tasks and measure the quality of our dataset along multiple dimensions, including accuracy against known facts, relevance to materials science, as well as completeness and reasonableness of the data. Moreover, we iteratively generate more targeted instructions and instruction-data in a finetuning-evaluation-feedback loop leading to progressively better performance for our finetuned HoneyBee models. Our evaluation on the MatSci-NLP benchmark shows HoneyBee's outperformance of existing language models on materials science tasks and iterative improvement in successive stages of instruction-data refinement. We study the quality of HoneyBee's language modeling through automatic evaluation and analyze case studies to further understand the model's capabilities and limitations. Our code and relevant datasets are publicly available at \url{https://github.com/BangLab-UdeM-Mila/NLP4MatSci-HoneyBee}.

翻译：我们提出了一种基于指令的可靠材料科学数据策展流程（MatSci-Instruct），并据此对面向材料科学的LLaMa基座语言模型（HoneyBee）进行微调。MatSci-Instruct有效缓解了公开文献中高质量材料科学文本数据稀缺的问题，而HoneyBee则是首个专用于材料科学的十亿参数级语言模型。在MatSci-Instruct中，我们通过Instructor模块（如Chat-GPT）驱动多个商用大语言模型生成数据，并由独立的Verifier模块（如Claude）进行验证，从而提升生成数据的可靠性。利用MatSci-Instruct，我们构建了包含多任务的数据集，并从已知事实准确性、材料科学相关性、数据完整性与合理性等多个维度评估数据集质量。此外，我们通过微调-评估-反馈循环迭代生成更具针对性的指令与指令数据，使微调后的HoneyBee模型性能逐步提升。在MatSci-NLP基准上的评估表明，HoneyBee在材料科学任务上超越现有语言模型，并在指令数据优化的逐阶段迭代中持续改进。我们通过自动化评估研究HoneyBee的语言建模质量，并结合案例分析进一步理解模型的能力与局限。相关代码与数据集已开源发布于 \url{https://github.com/BangLab-UdeM-Mila/NLP4MatSci-HoneyBee}。