Towards Cross-Modal Text-Molecule Retrieval with Better Modality Alignment

Cross-modal text-molecule retrieval model aims to learn a shared feature space of the text and molecule modalities for accurate similarity calculation, which facilitates the rapid screening of molecules with specific properties and activities in drug design. However, previous works have two main defects. First, they are inadequate in capturing modality-shared features considering the significant gap between text sequences and molecule graphs. Second, they mainly rely on contrastive learning and adversarial training for cross-modality alignment, both of which mainly focus on the first-order similarity, ignoring the second-order similarity that can capture more structural information in the embedding space. To address these issues, we propose a novel cross-modal text-molecule retrieval model with two-fold improvements. Specifically, on the top of two modality-specific encoders, we stack a memory bank based feature projector that contain learnable memory vectors to extract modality-shared features better. More importantly, during the model training, we calculate four kinds of similarity distributions (text-to-text, text-to-molecule, molecule-to-molecule, and molecule-to-text similarity distributions) for each instance, and then minimize the distance between these similarity distributions (namely second-order similarity losses) to enhance cross-modal alignment. Experimental results and analysis strongly demonstrate the effectiveness of our model. Particularly, our model achieves SOTA performance, outperforming the previously-reported best result by 6.4%.

翻译：跨模态文本-分子检索模型旨在学习文本与分子模态的共享特征空间，以实现精确的相似度计算，从而促进药物设计中快速筛选具有特定性质与活性的分子。然而，现有研究存在两个主要缺陷。首先，考虑到文本序列与分子图结构间的显著差异，现有方法在捕获模态共享特征方面存在不足。其次，这些方法主要依赖对比学习与对抗训练进行跨模态对齐，二者均侧重于一阶相似度，而忽略了能在嵌入空间中捕获更多结构信息的二阶相似度。为解决这些问题，我们提出一种新颖的跨模态文本-分子检索模型，其改进主要体现在两方面。具体而言，我们在两个模态专用编码器之上，叠加了一个基于记忆库的特征投影器，该投影器包含可学习的记忆向量，以更好地提取模态共享特征。更重要的是，在模型训练过程中，我们为每个实例计算四种相似度分布（文本-文本、文本-分子、分子-分子及分子-文本相似度分布），并通过最小化这些相似度分布之间的距离（即二阶相似度损失）来增强跨模态对齐。实验结果与分析充分证明了我们模型的有效性。特别地，我们的模型实现了最先进的性能，较先前报道的最佳结果提升了6.4%。