Graph Self-Supervised Learning (GSSL) provides a robust pathway for acquiring embeddings without expert labelling, a capability that carries profound implications for molecular graphs due to the staggering number of potential molecules and the high cost of obtaining labels. However, GSSL methods are designed not for optimisation within a specific domain but rather for transferability across a variety of downstream tasks. This broad applicability complicates their evaluation. Addressing this challenge, we present "Molecular Graph Representation Evaluation" (MOLGRAPHEVAL), generating detailed profiles of molecular graph embeddings with interpretable and diversified attributes. MOLGRAPHEVAL offers a suite of probing tasks grouped into three categories: (i) generic graph, (ii) molecular substructure, and (iii) embedding space properties. By leveraging MOLGRAPHEVAL to benchmark existing GSSL methods against both current downstream datasets and our suite of tasks, we uncover significant inconsistencies between inferences drawn solely from existing datasets and those derived from more nuanced probing. These findings suggest that current evaluation methodologies fail to capture the entirety of the landscape.
翻译:图自监督学习(GSSL)无需专家标注即可获取嵌入向量,这一能力对分子图具有重要意义,原因在于潜在分子数量惊人且获取标签成本高昂。然而,GSSL方法的设计目标并非在特定领域内进行优化,而是追求跨多种下游任务的迁移能力。这种广泛适用性使其评估变得复杂。针对这一挑战,我们提出"分子图表征评估"(MOLGRAPHEVAL),通过可解释且多样化的属性生成分子图嵌入的详细分析档案。MOLGRAPHEVAL提供三组探针任务:(i)通用图属性、(ii)分子子结构属性及(iii)嵌入空间属性。利用MOLGRAPHEVAL对现有GSSL方法进行基准测试,同时结合当前下游数据集与我们的任务套件,我们发现仅基于现有数据集得出的推论与通过更精细化探针获得的结果之间存在显著不一致性。这些发现表明,当前评估方法未能全面捕捉分子图嵌入的特征全貌。