Context-aware Graph Causality Inference for Few-Shot Molecular Property Prediction

Molecular property prediction is becoming one of the major applications of graph learning in Web-based services, e.g., online protein structure prediction and drug discovery. A key challenge arises in few-shot scenarios, where only a few labeled molecules are available for predicting unseen properties. Recently, several studies have used in-context learning to capture relationships among molecules and properties, but they face two limitations in: (1) exploiting prior knowledge of functional groups that are causally linked to properties and (2) identifying key substructures directly correlated with properties. We propose CaMol, a context-aware graph causality inference framework, to address these challenges by using a causal inference perspective, assuming that each molecule consists of a latent causal structure that determines a specific property. First, we introduce a context graph that encodes chemical knowledge by linking functional groups, molecules, and properties to guide the discovery of causal substructures. Second, we propose a learnable atom masking strategy to disentangle causal substructures from confounding ones. Third, we introduce a distribution intervener that applies backdoor adjustment by combining causal substructures with chemically grounded confounders, disentangling causal effects from real-world chemical variations. Experiments on diverse molecular datasets showed that CaMol achieved superior accuracy and sample efficiency in few-shot tasks, showing its generalizability to unseen properties. Also, the discovered causal substructures were strongly aligned with chemical knowledge about functional groups, supporting the model interpretability.

翻译：分子性质预测正逐渐成为基于网络服务的图学习主要应用之一，例如在线蛋白质结构预测和药物发现。少样本场景下的关键挑战在于：仅能获得少量标记分子用于预测未知性质。近期研究采用上下文学习来捕捉分子与性质间的关系，但在两方面存在局限：(1) 未能充分利用与性质存在因果关联的官能团先验知识；(2) 难以识别与性质直接相关的关键子结构。本文提出CaMol——一种上下文感知图因果推理框架，通过因果推断视角应对这些挑战，其基本假设是每个分子包含决定特定性质的潜在因果结构。首先，我们构建了编码化学知识的上下文图，通过连接官能团、分子与性质来指导因果子结构的发现。其次，我们提出可学习的原子掩蔽策略，以分离因果子结构与混杂子结构。第三，我们设计分布干预器，通过将因果子结构与基于化学背景的混杂因子结合实施后门调整，从而从真实化学变异中解耦因果效应。在多组分子数据集上的实验表明，CaMol在少样本任务中实现了优异的准确性与样本效率，展现了其对未知性质的泛化能力。同时，所发现的因果子结构与官能团的化学知识高度吻合，有力支撑了模型的可解释性。