ChemCLIP: Bridging Organic and Inorganic Anticancer Compounds Through Contrastive Learning

The discovery of anticancer therapeutics has traditionally treated organic small molecules and metal-based coordination complexes as separate chemical domains, limiting knowledge transfer despite their shared biological objectives. This disparity is particularly pronounced in available data, with extensive screening databases for organic compounds compared to only a few thousand characterized metal complexes. Here, we introduce ChemCLIP, a dual-encoder contrastive learning framework that bridges this organic-inorganic divide by learning unified representations based on shared anticancer activities rather than structural similarity. We compiled complementary datasets comprising 44,854 unique organic compounds and 5,164 unique metal complexes, standardized across 60 cancer cell lines. By training parallel encoders with activity-aware hard negative mining, we mapped structurally distinct compounds into a shared 256-dimensional embedding space where biologically similar compounds cluster together regardless of chemical class. We systematically evaluated four molecular encoding strategies: Morgan fingerprints, ChemBERTa, MolFormer, and Chemprop, through quantitative alignment metrics, embedding visualizations, and downstream classification tasks. Morgan fingerprints achieved superior performance with an average alignment ratio of 0.899 and downstream classification AUCs of 0.859 (inorganic) and 0.817 (organic). This work establishes contrastive learning as an effective strategy for unifying disparate chemical domains and provides empirical guidance for encoder selection in multi-modal chemistry applications, with implications extending beyond anticancer drug discovery to any scenario requiring cross-domain chemical knowledge transfer.

翻译：抗癌疗法的发现传统上将有机小分子和金属基配位络合物视为独立的化学领域，尽管它们具有共同的生物学目标，但知识迁移受到限制。这种差异在可用数据中尤为突出：有机化合物拥有庞大的筛选数据库，而具有特征的金属络合物仅有数千种。为此，我们提出了ChemCLIP——一种双编码器对比学习框架，通过基于共享抗癌活性而非结构相似性的统一表征学习，弥合了这种有机-无机分野。我们整理了互补数据集，包含44,854种独特有机化合物和5,164种独特金属络合物，并基于60种癌细胞系进行了标准化处理。通过采用活性感知困难负样本挖掘训练并行编码器，我们将结构迥异的化合物映射到共享的256维嵌入空间中，使得化学类别不同但生物学相似的化合物聚集在一起。我们系统评估了四种分子编码策略：Morgan指纹、ChemBERTa、MolFormer和Chemprop，通过定量对齐度量、嵌入可视化及下游分类任务进行了比较。Morgan指纹实现了最优性能，其平均对齐比为0.899，下游分类AUC值分别为0.859（无机）和0.817（有机）。本工作确立了对比学习作为统一不同化学领域的有效策略，并为多模态化学应用中的编码器选择提供了实证指导，其影响可扩展至抗癌药物发现之外的任何需要跨域化学知识迁移的场景。