This paper focuses on three critical problems on protein classification. Firstly, Carbohydrate-active enzyme (CAZyme) classification can help people to understand the properties of enzymes. However, one CAZyme may belong to several classes. This leads to Multi-label CAZyme classification. Secondly, to capture information from the secondary structure of protein, protein classification is modeled as graph classification problem. Thirdly, compound-protein interactions prediction employs graph learning for compound with sequential embedding for protein. This can be seen as classification task for compound-protein pairs. This paper proposes three models for protein classification. Firstly, this paper proposes a Multi-label CAZyme classification model using CNN-LSTM with Attention mechanism. Secondly, this paper proposes a variational graph autoencoder based subspace learning model for protein graph classification. Thirdly, this paper proposes graph isomorphism networks (GIN) and Attention-based CNN-LSTM for compound-protein interactions prediction, as well as comparing GIN with graph convolution networks (GCN) and graph attention networks (GAT) in this task. The proposed models are effective for protein classification. Source code and data are available at https://github.com/zshicode/GNN-AttCL-protein. Besides, this repository collects and collates the benchmark datasets with respect to above problems, including CAZyme classification, enzyme protein graph classification, compound-protein interactions prediction, drug-target affinities prediction and drug-drug interactions prediction. Hence, the usage for evaluation by benchmark datasets can be more conveniently.
翻译:本文聚焦于蛋白质分类中的三个关键问题。首先,碳水化合物活性酶分类有助于理解酶的特性,但单个碳水化合物活性酶可能属于多个类别,这导致了多标签碳水化合物活性酶分类问题。其次,为捕获蛋白质二级结构信息,蛋白质分类被建模为图分类问题。第三,化合物-蛋白质相互作用预测采用图学习处理化合物,并结合蛋白质的序列嵌入,可视为化合物-蛋白质对的分类任务。本文提出三种蛋白质分类模型:其一,基于CNN-LSTM与注意力机制的多标签碳水化合物活性酶分类模型;其二,基于变分图自编码器的子空间学习模型用于蛋白质图分类;其三,提出图同构网络与基于注意力的CNN-LSTM用于化合物-蛋白质相互作用预测,并在该任务中与图卷积网络和图注意力网络进行比较。所提模型对蛋白质分类有效。源代码与数据见https://github.com/zshicode/GNN-AttCL-protein。此外,该仓库收集整理了上述问题的基准数据集,包括碳水化合物活性酶分类、酶蛋白图分类、化合物-蛋白质相互作用预测、药物-靶标亲和力预测及药物-药物相互作用预测,便于利用基准数据集进行评估。