The advent of artificial intelligence (AI) has enabled a comprehensive exploration of materials for various applications. However, AI models often prioritize frequently encountered materials in the scientific literature, limiting the selection of suitable candidates based on inherent physical and chemical properties. To address this imbalance, we have generated a dataset of 1,494,017 natural language-material paragraphs based on combined OQMD, Materials Project, JARVIS, COD and AFLOW2 databases, which are dominated by ab initio calculations and tend to be much more evenly distributed on the periodic table. The generated text narratives were then polled and scored by both human experts and ChatGPT-4, based on three rubrics: technical accuracy, language and structure, and relevance and depth of content, showing similar scores but with human-scored depth of content being the most lagging. The merger of multi-modality data sources and large language model (LLM) holds immense potential for AI frameworks to help the exploration and discovery of solid-state materials for specific applications.
翻译:人工智能(AI)的出现使得针对各种应用的材料进行全面探索成为可能。然而,AI模型往往优先考虑科学文献中频繁出现的材料,限制了基于固有理化特性筛选合适候选材料的能力。为应对这一不平衡问题,我们基于OQMD、Materials Project、JARVIS、COD和AFLOW2的联合数据库(这些数据库以第一性原理计算为主,且在元素周期表上分布更为均匀)生成了包含1,494,017个自然语言-材料段落的文本叙事数据集。随后,人类专家与ChatGPT-4根据三个评分标准对生成的文本叙事进行打分:技术准确性、语言与结构、内容相关性与深度。结果显示双方评分相似,但人类对内容深度的评分最为滞后。多模态数据源与大语言模型(LLM)的融合,为AI框架助力特定应用固态材料的探索与发现带来了巨大潜力。