Tandem mass spectrometry has played a pivotal role in advancing proteomics, enabling the analysis of protein composition in biological samples. Despite the development of various deep learning methods for identifying amino acid sequences (peptides) responsible for observed spectra, challenges persist in \emph{de novo} peptide sequencing. Firstly, prior methods struggle to identify amino acids with post-translational modifications (PTMs) due to their lower frequency in training data compared to canonical amino acids, further resulting in decreased peptide-level identification precision. Secondly, diverse types of noise and missing peaks in mass spectra reduce the reliability of training data (peptide-spectrum matches, PSMs). To address these challenges, we propose AdaNovo, a novel framework that calculates conditional mutual information (CMI) between the spectrum and each amino acid/peptide, using CMI for adaptive model training. Extensive experiments demonstrate AdaNovo's state-of-the-art performance on a 9-species benchmark, where the peptides in the training set are almost completely disjoint from the peptides of the test sets. Moreover, AdaNovo excels in identifying amino acids with PTMs and exhibits robustness against data noise. The supplementary materials contain the official code.
翻译:串联质谱技术在推动蛋白质组学发展、实现生物样本中蛋白质组成分析方面发挥了关键作用。尽管目前已开发出多种深度学习方法用于识别产生观测谱图的氨基酸序列(肽段),但从头肽段测序仍面临诸多挑战。首先,现有方法难以识别携带翻译后修饰(PTM)的氨基酸,这是因为训练数据中此类氨基酸的出现频率远低于标准氨基酸,进而导致肽段级识别精度的下降。其次,质谱中不同类型的噪声和缺失峰降低了训练数据(肽段-谱图匹配,PSM)的可靠性。为解决上述问题,我们提出AdaNovo这一新型框架,通过计算谱图与每个氨基酸/肽段之间的条件互信息(CMI),并基于CMI实现自适应模型训练。大量实验表明,在包含9个物种的基准数据集上,当训练集与测试集的肽段几乎完全不相交时,AdaNovo展现出最先进的性能。此外,AdaNovo在识别携带PTM的氨基酸方面表现优异,并对数据噪声具有鲁棒性。补充材料中提供了官方代码。