With the surge in available data from various modalities, there is a growing need to bridge the gap between different data types. In this work, we introduce a novel approach to learn cross-modal representations between image data and molecular representations for drug discovery. We propose EMM and IMM, two innovative loss functions built on top of CLIP that leverage weak supervision and cross sites replicates in High-Content Screening. Evaluating our model against known baseline on cross-modal retrieval, we show that our proposed approach allows to learn better representations and mitigate batch effect. In addition, we also present a preprocessing method for the JUMP-CP dataset that effectively reduce the required space from 85Tb to a mere usable 7Tb size, still retaining all perturbations and most of the information content.
翻译:随着多模态数据的大量涌现,弥合不同类型数据之间的鸿沟变得日益迫切。本研究提出了一种新颖方法,用于在药物发现领域学习图像数据与分子表征之间的跨模态表示。我们提出了EMM与IMM——两种基于CLIP的创新型损失函数,通过利用弱监督及高内涵筛选中的跨位点重复数据来优化模型。在跨模态检索任务中,我们将模型与已知基线进行对比,结果表明所提方法能够学习到更优的表征并减轻批次效应。此外,我们还提出了一种针对JUMP-CP数据集的预处理方法,该方法将所需存储空间从85TB有效压缩至仅7TB,同时保留所有扰动条件与绝大部分信息内容。