Pre-trained Language Models have emerged as promising tools for predicting molecular properties, yet their development is in its early stages, necessitating further research to enhance their efficacy and address challenges such as generalization and sample efficiency. In this paper, we present a multi-view approach that combines latent spaces derived from state-of-the-art chemical models. Our approach relies on two pivotal elements: the embeddings derived from MHG-GNN, which represent molecular structures as graphs, and MoLFormer embeddings rooted in chemical language. The attention mechanism of MoLFormer is able to identify relations between two atoms even when their distance is far apart, while the GNN of MHG-GNN can more precisely capture relations among multiple atoms closely located. In this work, we demonstrate the superior performance of our proposed multi-view approach compared to existing state-of-the-art methods, including MoLFormer-XL, which was trained on 1.1 billion molecules, particularly in intricate tasks such as predicting clinical trial drug toxicity and inhibiting HIV replication. We assessed our approach using six benchmark datasets from MoleculeNet, where it outperformed competitors in five of them. Our study highlights the potential of latent space fusion and feature integration for advancing molecular property prediction. In this work, we use small versions of MHG-GNN and MoLFormer, which opens up an opportunity for further improvement when our approach uses a larger-scale dataset.
翻译:预训练语言模型已成为分子性质预测的有力工具,但其发展仍处于早期阶段,需要进一步研究以提升其效能并解决泛化能力与样本效率等挑战。本文提出了一种多视角方法,融合了来自最先进化学模型的潜在空间。该方法依赖于两个关键要素:从MHG-GNN导出的嵌入(该模型将分子结构表示为图),以及基于化学语言的MoLFormer嵌入。MoLFormer的注意力机制能够识别远距离原子之间的关系,而MHG-GNN的图神经网络则能更精确地捕获邻近多个原子间的关联。我们证明了所提出的多视角方法在性能上优于现有最先进方法(包括基于11亿分子训练的MoLFormer-XL),尤其在预测临床试验药物毒性和抑制HIV复制等复杂任务中表现突出。我们使用MoleculeNet的六个基准数据集评估了该方法,其中五项任务的表现优于现有方案。本研究彰显了潜在空间融合与特征整合在推进分子性质预测中的潜力。本工作中使用了MHG-GNN和MoLFormer的小型版本,这为未来在更大规模数据集上应用该方法实现性能提升提供了可能。