Co-modeling the Sequential and Graphical Route for Peptide

Peptides are formed by the dehydration condensation of multiple amino acids. The primary structure of a peptide can be represented either as an amino acid sequence or as a molecular graph consisting of atoms and chemical bonds. Previous studies have indicated that deep learning routes specific to sequential and graphical peptide forms exhibit comparable performance on downstream tasks. Despite the fact that these models learn representations of the same modality of peptides, we find that they explain their predictions differently. Considering sequential and graphical models as two experts making inferences from different perspectives, we work on fusing expert knowledge to enrich the learned representations for improving the discriminative performance. To achieve this, we propose a peptide co-modeling method, RepCon, which employs a contrastive learning-based framework to enhance the mutual information of representations from decoupled sequential and graphical end-to-end models. It considers representations from the sequential encoder and the graphical encoder for the same peptide sample as a positive pair and learns to enhance the consistency of representations between positive sample pairs and to repel representations between negative pairs. Empirical studies of RepCon and other co-modeling methods are conducted on open-source discriminative datasets, including aggregation propensity, retention time, antimicrobial peptide prediction, and family classification from Peptide Database. Our results demonstrate the superiority of the co-modeling approach over independent modeling, as well as the superiority of RepCon over other methods under the co-modeling framework. In addition, the attribution on RepCon further corroborates the validity of the approach at the level of model explanation.

翻译：摘要：多肽由多个氨基酸通过脱水缩合形成。其一级结构既可表示为氨基酸序列，也可表示为包含原子与化学键的分子图。以往研究表明，针对序列和图形两种多肽形式的深度学习路线在下游任务中表现相当。尽管这些模型学习的是同一模态的多肽表征，但我们发现其对预测结果的解释存在差异。将序列模型与图形模型视为从不同角度进行推理的两个专家，我们致力于融合专家知识以丰富学习表征，从而提升判别性能。为此，我们提出一种多肽协同建模方法RepCon，该方法采用基于对比学习的框架，通过增强解耦的序列端到端模型与图形端到端模型表征间的互信息，促进二者的协同。RepCon将同一多肽样本的序列编码器与图形编码器输出的表征视为正样本对，通过学习增强正样本对表征间的一致性，并排斥负样本对的表征。我们基于开源判别数据集（包括多肽聚集倾向性、保留时间、抗菌肽预测以及肽数据库家族分类）对RepCon及其他协同建模方法进行了实证研究。结果表明，协同建模方法优于独立建模，且在协同建模框架下RepCon优于其他方法。此外，对RepCon的归因分析进一步从模型解释层面验证了该方法的有效性。