Co-modeling the Sequential and Graphical Routes for Peptide Representation Learning

Peptides are formed by the dehydration condensation of multiple amino acids. The primary structure of a peptide can be represented either as an amino acid sequence or as a molecular graph consisting of atoms and chemical bonds. Previous studies have indicated that deep learning routes specific to sequential and graphical peptide forms exhibit comparable performance on downstream tasks. Despite the fact that these models learn representations of the same modality of peptides, we find that they explain their predictions differently. Considering sequential and graphical models as two experts making inferences from different perspectives, we work on fusing expert knowledge to enrich the learned representations for improving the discriminative performance. To achieve this, we propose a peptide co-modeling method, RepCon, which employs a contrastive learning-based framework to enhance the mutual information of representations from decoupled sequential and graphical end-to-end models. It considers representations from the sequential encoder and the graphical encoder for the same peptide sample as a positive pair and learns to enhance the consistency of representations between positive sample pairs and to repel representations between negative pairs. Empirical studies of RepCon and other co-modeling methods are conducted on open-source discriminative datasets, including aggregation propensity, retention time, antimicrobial peptide prediction, and family classification from Peptide Database. Our results demonstrate the superiority of the co-modeling approach over independent modeling, as well as the superiority of RepCon over other methods under the co-modeling framework. In addition, the attribution on RepCon further corroborates the validity of the approach at the level of model explanation.

翻译：肽是由多个氨基酸通过脱水缩合形成的。肽的一级结构既可以表示为氨基酸序列，也可以表示为包含原子和化学键的分子图。已有研究表明，针对序列形式和图形形式的肽所设计的深度学习路径在下游任务中表现出相近的性能。尽管这些模型学习的是同一模态（肽）的表征，但我们发现它们对预测结果的解释方式存在差异。将序列模型和图形模型视为从不同视角进行推断的两个专家，我们致力于融合专家知识以丰富所学表征，从而提升判别性能。为此，我们提出了一种肽联合建模方法RepCon，该方法采用基于对比学习的框架，增强来自解耦的序列端到端模型和图形端到端模型的表征之间的互信息。它将同一肽样本的序列编码器表征和图形编码器表征视为正样本对，学习增强正样本对之间表征的一致性，同时推开负样本对之间的表征。我们基于开源判别数据集（包括肽数据库中的聚集倾向性、保留时间、抗菌肽预测及家族分类）对RepCon及其他联合建模方法进行了实证研究。结果表明：联合建模方法优于独立建模方法，且在联合建模框架下RepCon优于其他方法。此外，对RepCon的归因分析进一步在模型解释层面验证了该方法的有效性。