Tandem mass spectrometry (MS/MS) stands as the predominant high-throughput technique for comprehensively analyzing protein content within biological samples. This methodology is a cornerstone driving the advancement of proteomics. In recent years, substantial strides have been made in Data-Independent Acquisition (DIA) strategies, facilitating impartial and non-targeted fragmentation of precursor ions. The DIA-generated MS/MS spectra present a formidable obstacle due to their inherent high multiplexing nature. Each spectrum encapsulates fragmented product ions originating from multiple precursor peptides. This intricacy poses a particularly acute challenge in de novo peptide/protein sequencing, where current methods are ill-equipped to address the multiplexing conundrum. In this paper, we introduce DiaTrans, a deep-learning model based on transformer architecture. It deciphers peptide sequences from DIA mass spectrometry data. Our results show significant improvements over existing STOA methods, including DeepNovo-DIA and PepNet. Casanovo-DIA enhances precision by 15.14% to 34.8%, recall by 11.62% to 31.94% at the amino acid level, and boosts precision by 59% to 81.36% at the peptide level. Integrating DIA data and our DiaTrans model holds considerable promise to uncover novel peptides and more comprehensive profiling of biological samples. Casanovo-DIA is freely available under the GNU GPL license at https://github.com/Biocomputing-Research-Group/DiaTrans.
翻译:串联质谱(MS/MS)是全面分析生物样本蛋白质含量的主流高通量技术,该方法是推动蛋白质组学发展的基石。近年来,数据非依赖性采集(DIA)策略取得重大进展,实现了对前体离子的无偏、非靶向碎裂。DIA生成的MS/MS谱图因其固有的高度复用性而极具挑战性:每张谱图包含来自多个前体肽段的碎裂产物离子。这种复杂性对从头肽段/蛋白质测序构成尤为严峻的挑战,现有方法难以解决这一复用难题。本文提出DiaTrans——一种基于Transformer架构的深度学习模型,能够从DIA质谱数据中解析肽段序列。实验结果表明,该方法在氨基酸水平上的精确率提升15.14%至34.8%,召回率提升11.62%至31.94%;在肽段水平上的精确率提升59%至81.36%,显著优于现有最优方法(包括DeepNovo-DIA和PepNet)。整合DIA数据与DiaTrans模型有望发现新型肽段,并对生物样本进行更全面的分析。Casanovo-DIA在GNU GPL许可下免费开源,代码托管于https://github.com/Biocomputing-Research-Group/DiaTrans。