Glycans are basic biomolecules and perform essential functions within living organisms. The rapid increase of functional glycan data provides a good opportunity for machine learning solutions to glycan understanding. However, there still lacks a standard machine learning benchmark for glycan property and function prediction. In this work, we fill this blank by building a comprehensive benchmark for Glycan Machine Learning (GlycanML). The GlycanML benchmark consists of diverse types of tasks including glycan taxonomy prediction, glycan immunogenicity prediction, glycosylation type prediction, and protein-glycan interaction prediction. Glycans can be represented by both sequences and graphs in GlycanML, which enables us to extensively evaluate sequence-based models and graph neural networks (GNNs) on benchmark tasks. Furthermore, by concurrently performing eight glycan taxonomy prediction tasks, we introduce the GlycanML-MTL testbed for multi-task learning (MTL) algorithms. Also, we evaluate how taxonomy prediction can boost other three function prediction tasks by MTL. Experimental results show the superiority of modeling glycans with multi-relational GNNs, and suitable MTL methods can further boost model performance. We provide all datasets and source codes at https://github.com/GlycanML/GlycanML and maintain a leaderboard at https://GlycanML.github.io/project
翻译:聚糖是基础生物分子,在生物体内发挥着关键功能。功能性聚糖数据的快速增长为利用机器学习方法理解聚糖提供了良好机遇。然而,目前仍缺乏用于聚糖性质与功能预测的标准机器学习基准。本研究通过构建全面的聚糖机器学习(GlycanML)基准填补了这一空白。GlycanML基准包含多种类型的任务,包括聚糖分类学预测、聚糖免疫原性预测、糖基化类型预测以及蛋白质-聚糖相互作用预测。在GlycanML中,聚糖可通过序列和图两种形式表示,这使我们能够在基准任务上广泛评估基于序列的模型和图神经网络(GNNs)。此外,通过并行执行八项聚糖分类学预测任务,我们构建了用于多任务学习(MTL)算法的GlycanML-MTL测试平台。同时,我们评估了通过MTL方法如何利用分类学预测提升其他三项功能预测任务的性能。实验结果表明,采用多关系图神经网络建模聚糖具有优越性,而合适的MTL方法能进一步提升模型性能。我们在https://github.com/GlycanML/GlycanML 提供全部数据集与源代码,并在https://GlycanML.github.io/project 维护性能排行榜。