Graph Contrastive Learning (GCL) is a powerful self-supervised learning framework that performs data augmentation through graph perturbations, with growing applications in the analysis of biological networks such as Gene Regulatory Networks (GRNs). The artificial perturbations commonly used in GCL, such as node dropping, induce structural changes that can diverge from biological reality. This concern has contributed to a broader trend in graph representation learning toward augmentation-free methods, which view such structural changes as problematic and should be avoided. However, this trend overlooks the fundamental insight that structural changes from biologically meaningful perturbations are not a problem to be avoided, but rather a rich source of information, thereby ignoring the valuable opportunity to leverage data from real biological experiments. Motivated by this insight, we propose SupGCL (Supervised Graph Contrastive Learning), a new GCL method for GRNs that directly incorporates biological perturbations from gene knockdown experiments as supervision. SupGCL is a probabilistic formulation that continuously generalizes conventional GCL, linking artificial augmentations with real perturbations measured in knockdown experiments, and using the latter as explicit supervision. On patient-derived GRNs from three cancer types, we train GRN representations with SupGCL and evaluate it in two regimes: (i) embedding space analysis, where it yields clearer disease-subtype structure and improves clustering, and (ii) task-specific fine-tuning, where it consistently outperforms strong graph representation learning baselines on 13 downstream tasks spanning gene-level functional annotation and patient-level prediction.
翻译:图对比学习(GCL)是一种强大的自监督学习框架,通过图扰动进行数据增强,在基因调控网络(GRN)等生物网络分析中的应用日益广泛。GCL中常用的人工扰动(如节点丢弃)会引发结构变化,这些变化可能与生物现实相偏离。这一顾虑推动了图表示学习领域更广泛地转向无增强方法,这类方法将此类结构变化视为问题并加以回避。然而,这一趋势忽视了一个根本性洞见:源自具有生物学意义的扰动所产生的结构变化并非需要避免的问题,反而是丰富的信息来源,从而忽略了利用真实生物实验数据的宝贵机会。基于此洞见,我们提出SupGCL(监督图对比学习),这是一种面向GRN的新型GCL方法,可直接整合基因敲除实验中的生物扰动作为监督信号。SupGCL采用概率化建模形式,连续泛化了传统GCL方法,将人工增强与敲除实验中测量的真实扰动相关联,并以后者作为显式监督。基于三种癌症类型的患者来源GRN数据,我们使用SupGCL训练GRN表示,并在两种范式下进行评估:(i)嵌入空间分析,该方法能产生更清晰的疾病亚型结构并改善聚类效果;(ii)任务特定微调,在涵盖基因水平功能注释和患者水平预测的13项下游任务中,该方法持续优于强基准图表示学习方法。