Large language models (LLMs) are increasingly being used in materials science. However, little attention has been given to benchmarking and standardized evaluation for LLM-based materials property prediction, which hinders progress. We present LLM4Mat-Bench, the largest benchmark to date for evaluating the performance of LLMs in predicting the properties of crystalline materials. LLM4Mat-Bench contains about 1.9M crystal structures in total, collected from 10 publicly available materials data sources, and 45 distinct properties. LLM4Mat-Bench features different input modalities: crystal composition, CIF, and crystal text description, with 4.7M, 615.5M, and 3.1B tokens in total for each modality, respectively. We use LLM4Mat-Bench to fine-tune models with different sizes, including LLM-Prop and MatBERT, and provide zero-shot and few-shot prompts to evaluate the property prediction capabilities of LLM-chat-like models, including Llama, Gemma, and Mistral. The results highlight the challenges of general-purpose LLMs in materials science and the need for task-specific predictive models and task-specific instruction-tuned LLMs in materials property prediction.
翻译:大语言模型(LLMs)在材料科学中的应用日益增多。然而,目前对于基于LLM的材料性能预测的基准测试和标准化评估关注甚少,这阻碍了该领域的进展。我们提出了LLM4Mat-Bench,这是迄今为止用于评估LLMs在预测晶体材料性能方面表现的最大规模基准。LLM4Mat-Bench总共包含约190万个晶体结构,收集自10个公开可用的材料数据源,涵盖45种不同的性能。LLM4Mat-Bench具有不同的输入模态:晶体成分、CIF文件和晶体文本描述,每种模态的总标记数分别为470万、6.155亿和31亿。我们利用LLM4Mat-Bench对不同规模的模型(包括LLM-Prop和MatBERT)进行微调,并提供零样本和少样本提示,以评估类LLM聊天模型(包括Llama、Gemma和Mistral)的性能预测能力。结果突显了通用LLMs在材料科学中面临的挑战,以及在材料性能预测领域对任务特定的预测模型和任务特定的指令微调LLMs的需求。