Using machine learning (ML) techniques to predict material properties is a crucial research topic. These properties depend on numerical data and semantic factors. Due to the limitations of small-sample datasets, existing methods typically adopt ML algorithms to regress numerical properties or transfer other pre-trained knowledge graphs (KGs) to the material. However, these methods cannot simultaneously handle semantic and numerical information. In this paper, we propose a numerical reasoning method for material KGs (NR-KG), which constructs a cross-modal KG using semantic nodes and numerical proxy nodes. It captures both types of information by projecting KG into a canonical KG and utilizes a graph neural network to predict material properties. In this process, a novel projection prediction loss is proposed to extract semantic features from numerical information. NR-KG facilitates end-to-end processing of cross-modal data, mining relationships and cross-modal information in small-sample datasets, and fully utilizes valuable experimental data to enhance material prediction. We further propose two new High-Entropy Alloys (HEA) property datasets with semantic descriptions. NR-KG outperforms state-of-the-art (SOTA) methods, achieving relative improvements of 25.9% and 16.1% on two material datasets. Besides, NR-KG surpasses SOTA methods on two public physical chemistry molecular datasets, showing improvements of 22.2% and 54.3%, highlighting its potential application and generalizability. We hope the proposed datasets, algorithms, and pre-trained models can facilitate the communities of KG and AI for materials.
翻译:利用机器学习技术预测材料属性是一个重要的研究课题。这些属性既依赖于数值数据,也受语义因素影响。由于小样本数据集的局限性,现有方法通常采用机器学习算法对数值属性进行回归,或将从其他领域预训练的知识图谱迁移至材料领域。然而,这些方法无法同时处理语义与数值信息。本文提出了一种面向材料知识图谱的数值推理方法(NR-KG),该方法通过语义节点与数值代理节点构建跨模态知识图谱,将知识图谱投影到规范知识图谱中以捕获两类信息,并利用图神经网络预测材料属性。在此过程中,我们提出了一种新颖的投影预测损失函数,用于从数值信息中提取语义特征。NR-KG能够端到端处理跨模态数据,挖掘小样本数据集中的关系与跨模态信息,并充分利用有价值的实验数据提升材料预测性能。我们进一步提出了两个包含语义描述的新型高熵合金属性数据集。NR-KG在性能上超越了当前最优方法,在两个材料数据集上分别实现了25.9%和16.1%的相对提升。此外,NR-KG在两个公开的物理化学分子数据集上分别超越最优方法22.2%和54.3%,彰显了其潜在应用价值与泛化能力。我们期望所提出的数据集、算法及预训练模型能够推动知识图谱与材料人工智能领域的发展。