Molecular representations are inherently task-dependent, yet most pre-trained molecular encoders are not. Task conditioning promises representations that reorganize based on task descriptions, but existing approaches rely on expensive labeled data. We show that weak supervision on programmatically derived molecular motifs is sufficient. Our Adaptive Chemical Embedding Model (ACE-Mol) learns from hundreds of motifs paired with natural language descriptors that are cheap to compute, trivial to scale. Conventional encoders slowly search the embedding space for task-relevant structure, whereas ACE-Mol immediately aligns its representations with the task. ACE-Mol achieves state-of-the-art performance across molecular property prediction benchmarks with interpretable, chemically meaningful representations.
翻译:分子表示本质上是任务依赖的,然而大多数预训练的分子编码器并非如此。任务条件化有望根据任务描述重组表示,但现有方法依赖于昂贵的标注数据。我们证明,对程序化推导的分子基序进行弱监督已足够有效。我们的自适应化学嵌入模型(ACE-Mol)从数百个与自然语言描述符配对的基序中学习,这些描述符计算成本低廉,且易于扩展。传统编码器在嵌入空间中缓慢搜索与任务相关的结构,而ACE-Mol能立即将其表示与任务对齐。ACE-Mol在分子性质预测基准测试中实现了最先进的性能,并具有可解释且化学意义明确的表示。