With the rise of data-driven reaction prediction models, effective reaction descriptors are crucial for bridging the gap between real-world chemistry and digital representations. However, general-purpose, reaction-wise descriptors remain scarce. This study introduces RXNEmb, a novel reaction-level descriptor derived from RXNGraphormer, a model pre-trained to distinguish real reactions from fictitious ones with erroneous bond changes, thereby learning intrinsic bond formation and cleavage patterns. We demonstrate its utility by data-driven re-clustering of the USPTO-50k dataset, yielding a classification that more directly reflects bond-change similarities than rule-based categories. Combined with dimensionality reduction, RXNEmb enables visualization of reaction space diversity. Furthermore, attention weight analysis reveals the model's focus on chemically critical sites, providing mechanistic insight. RXNEmb serves as a powerful, interpretable tool for reaction fingerprinting and analysis, paving the way for more data-centric approaches in reaction analysis and discovery.
翻译:随着数据驱动反应预测模型的兴起,有效的反应描述符对于弥合真实化学与数字表征之间的鸿沟至关重要。然而,通用的、反应级别的描述符仍然稀缺。本研究引入了RXNEmb,这是一种新颖的反应级描述符,源自RXNGraphormer模型。该模型经过预训练,能够区分真实反应与具有错误键变化的虚构反应,从而学习内在的键形成与断裂模式。我们通过数据驱动的方式对USPTO-50k数据集进行重新聚类,证明了其效用,得到的分类比基于规则的类别更能直接反映键变化的相似性。结合降维技术,RXNEmb能够可视化反应空间的多样性。此外,注意力权重分析揭示了模型对化学关键位点的关注,提供了机理层面的见解。RXNEmb作为一种强大且可解释的工具,可用于反应指纹识别与分析,为反应分析和发现中更以数据为中心的方法铺平了道路。