Exploring the predictive capabilities of language models in material science is an ongoing interest. This study investigates the application of language model embeddings to enhance material property prediction in materials science. By evaluating various contextual embedding methods and pre-trained models, including Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT), we demonstrate that domain-specific models, particularly MatBERT significantly outperform general-purpose models in extracting implicit knowledge from compound names and material properties. Our findings reveal that information-dense embeddings from the third layer of MatBERT, combined with a context-averaging approach, offer the most effective method for capturing material-property relationships from the scientific literature. We also identify a crucial "tokenizer effect," highlighting the importance of specialized text processing techniques that preserve complete compound names while maintaining consistent token counts. These insights underscore the value of domain-specific training and tokenization in materials science applications and offer a promising pathway for accelerating the discovery and development of new materials through AI-driven approaches.
翻译:探索语言模型在材料科学中的预测能力是一个持续的研究热点。本研究探讨了语言模型嵌入在增强材料属性预测方面的应用。通过评估多种上下文嵌入方法和预训练模型,包括来自Transformer的双向编码器表示(BERT)和生成式预训练Transformer(GPT),我们证明领域专用模型——特别是MatBERT——在从化合物名称和材料属性中提取隐含知识方面显著优于通用模型。我们的研究结果表明,来自MatBERT第三层的信息密集嵌入,结合上下文平均方法,为从科学文献中捕获材料-属性关系提供了最有效的途径。我们还发现了一个关键的"标记化效应",强调了专用文本处理技术的重要性,这些技术能够保留完整的化合物名称,同时保持一致的标记数量。这些发现凸显了领域专用训练和标记化在材料科学应用中的价值,并为通过人工智能驱动的方法加速新材料的发现与开发提供了一条有前景的路径。