Tandem mass spectrometry (MS/MS) stands as the predominant high-throughput technique for comprehensively analyzing protein content within biological samples. This methodology is a cornerstone driving the advancement of proteomics. In recent years, substantial strides have been made in Data-Independent Acquisition (DIA) strategies, facilitating impartial and non-targeted fragmentation of precursor ions. The DIA-generated MS/MS spectra present a formidable obstacle due to their inherent high multiplexing nature. Each spectrum encapsulates fragmented product ions originating from multiple precursor peptides. This intricacy poses a particularly acute challenge in de novo peptide/protein sequencing, where current methods are ill-equipped to address the multiplexing conundrum. In this paper, we introduce DiaTrans, a deep-learning model based on transformer architecture. It deciphers peptide sequences from DIA mass spectrometry data. Our results show significant improvements over existing STOA methods, including DeepNovo-DIA and PepNet. Casanovo-DIA enhances precision by 15.14% to 34.8%, recall by 11.62% to 31.94% at the amino acid level, and boosts precision by 59% to 81.36% at the peptide level. Integrating DIA data and our DiaTrans model holds considerable promise to uncover novel peptides and more comprehensive profiling of biological samples. Casanovo-DIA is freely available under the GNU GPL license at https://github.com/Biocomputing-Research-Group/DiaTrans.
翻译:串联质谱(MS/MS)是全面分析生物样品中蛋白质含量的主流高通量技术,该方法是推动蛋白质组学发展的基石。近年来,数据非依赖采集(DIA)策略取得了重大进展,实现了前体离子的公正且非靶向的碎裂。DIA生成的MS/MS谱图因其固有的高度多重性而构成巨大挑战——每个谱图均包含源自多个前体肽段的碎片产物离子。这种复杂性对从头肽段/蛋白质测序提出了尤为严峻的挑战,现有方法难以有效应对这一多重性难题。本文提出DiaTrans,一种基于Transformer架构的深度学习模型,用于从DIA质谱数据中解析肽段序列。我们的结果表明,相较于现有最先进方法(包括DeepNovo-DIA和PepNet),Casanovo-DIA在氨基酸水平上将精确度提升了15.14%至34.8%,召回率提升了11.62%至31.94%;在肽段水平上将精确度提升了59%至81.36%。整合DIA数据与我们的DiaTrans模型,有望发现新型肽段并实现更全面的生物样品分析。Casanovo-DIA依据GNU GPL许可证发布于https://github.com/Biocomputing-Research-Group/DiaTrans,可免费获取。