Contextual Molecule Representation Learning from Chemical Reaction Knowledge

In recent years, self-supervised learning has emerged as a powerful tool to harness abundant unlabelled data for representation learning and has been broadly adopted in diverse areas. However, when applied to molecular representation learning (MRL), prevailing techniques such as masked sub-unit reconstruction often fall short, due to the high degree of freedom in the possible combinations of atoms within molecules, which brings insurmountable complexity to the masking-reconstruction paradigm. To tackle this challenge, we introduce REMO, a self-supervised learning framework that takes advantage of well-defined atom-combination rules in common chemistry. Specifically, REMO pre-trains graph/Transformer encoders on 1.7 million known chemical reactions in the literature. We propose two pre-training objectives: Masked Reaction Centre Reconstruction (MRCR) and Reaction Centre Identification (RCI). REMO offers a novel solution to MRL by exploiting the underlying shared patterns in chemical reactions as \textit{context} for pre-training, which effectively infers meaningful representations of common chemistry knowledge. Such contextual representations can then be utilized to support diverse downstream molecular tasks with minimum finetuning, such as affinity prediction and drug-drug interaction prediction. Extensive experimental results on MoleculeACE, ACNet, drug-drug interaction (DDI), and reaction type classification show that across all tested downstream tasks, REMO outperforms the standard baseline of single-molecule masked modeling used in current MRL. Remarkably, REMO is the pioneering deep learning model surpassing fingerprint-based methods in activity cliff benchmarks.

翻译：近年来，自监督学习已成为利用大量无标签数据进行表示学习的强大工具，并广泛适用于各个领域。然而，当应用于分子表示学习时，现有的掩码子单元重构等技术往往表现不佳，这是因为分子内原子可能组合方式的高度自由度给掩码-重构范式带来了难以克服的复杂性。为解决这一挑战，我们提出了REMO——一种利用常见化学中明确原子组合规则的自监督学习框架。具体而言，REMO在文献中170万个已知化学反应上对图/Transformer编码器进行预训练。我们提出了两个预训练目标：掩码反应中心重构和反应中心识别。REMO通过利用化学反应中潜在的共享模式作为预训练的上下文，为分子表示学习提供了一种新颖解决方案，有效推断出常见化学知识的有意义表示。这种上下文表示随后可用于支持各类下游分子任务（如亲和性预测和药物-药物相互作用预测）而仅需极少的微调。在MoleculeACE、ACNet、药物-药物相互作用和反应类型分类上的大量实验结果表明，在所有测试的下游任务中，REMO均优于当前分子表示学习中使用的单分子掩码建模标准基线。值得注意的是，REMO是首个在活性悬崖基准测试中超越基于指纹方法的深度学习模型。