Efficient catalyst screening necessitates predictive models for adsorption energy, a key property of reactivity. However, prevailing methods, notably graph neural networks (GNNs), demand precise atomic coordinates for constructing graph representations, while integrating observable attributes remains challenging. This research introduces CatBERTa, an energy prediction Transformer model using textual inputs. Built on a pretrained Transformer encoder, CatBERTa processes human-interpretable text, incorporating target features. Attention score analysis reveals CatBERTa's focus on tokens related to adsorbates, bulk composition, and their interacting atoms. Moreover, interacting atoms emerge as effective descriptors for adsorption configurations, while factors such as bond length and atomic properties of these atoms offer limited predictive contributions. By predicting adsorption energy from the textual representation of initial structures, CatBERTa achieves a mean absolute error (MAE) of 0.75 eV-comparable to vanilla Graph Neural Networks (GNNs). Furthermore, the subtraction of the CatBERTa-predicted energies effectively cancels out their systematic errors by as much as 19.3% for chemically similar systems, surpassing the error reduction observed in GNNs. This outcome highlights its potential to enhance the accuracy of energy difference predictions. This research establishes a fundamental framework for text-based catalyst property prediction, without relying on graph representations, while also unveiling intricate feature-property relationships.
翻译:高效催化剂筛选需要针对吸附能(反应活性的关键性质)建立预测模型。然而,现有方法(尤其是图神经网络)需要精确的原子坐标来构建图表示,同时整合可观测属性仍具挑战。本研究提出CatBERTa——一种基于文本输入的能量预测Transformer模型。CatBERTa建立在预训练Transformer编码器之上,处理人类可解释文本并整合目标特征。注意力得分分析表明,CatBERTa聚焦于与吸附物、体相组成及其相互作用原子相关的词元。值得注意的是,相互作用原子可作为吸附构型的有效描述符,而键长和这些原子的原子性质等因素的预测贡献有限。通过初始结构的文本表示预测吸附能,CatBERTa实现了0.75 eV的平均绝对误差——与标准图神经网络相当。此外,对于化学性质相近的系统,CatBERTa预测能量的差值可抵消系统误差高达19.3%,超越图神经网络的误差降低效果。这一结果凸显了提升能量差值预测精度的潜力。本研究建立了无需图表示、基于文本的催化剂性质预测基础框架,同时揭示了复杂的特征-性质关系。