Transformer-based de novo peptide sequencing for data-independent acquisition mass spectrometry

from arxiv, Ebrahimi S., Guo X. Transformer-based de novo peptide sequencing for data-independent acquisition mass spectrometry. In 2023 IEEE 23rd International Conference on Bioinformatics and Bioengineering (BIBE) 2022 Dec 6 (pp. 17-22). IEEE

Tandem mass spectrometry (MS/MS) stands as the predominant high-throughput technique for comprehensively analyzing protein content within biological samples. This methodology is a cornerstone driving the advancement of proteomics. In recent years, substantial strides have been made in Data-Independent Acquisition (DIA) strategies, facilitating impartial and non-targeted fragmentation of precursor ions. The DIA-generated MS/MS spectra present a formidable obstacle due to their inherent high multiplexing nature. Each spectrum encapsulates fragmented product ions originating from multiple precursor peptides. This intricacy poses a particularly acute challenge in de novo peptide/protein sequencing, where current methods are ill-equipped to address the multiplexing conundrum. In this paper, we introduce DiaTrans, a deep-learning model based on transformer architecture. It deciphers peptide sequences from DIA mass spectrometry data. Our results show significant improvements over existing STOA methods, including DeepNovo-DIA and PepNet. Casanovo-DIA enhances precision by 15.14% to 34.8%, recall by 11.62% to 31.94% at the amino acid level, and boosts precision by 59% to 81.36% at the peptide level. Integrating DIA data and our DiaTrans model holds considerable promise to uncover novel peptides and more comprehensive profiling of biological samples. Casanovo-DIA is freely available under the GNU GPL license at https://github.com/Biocomputing-Research-Group/DiaTrans.

翻译：串联质谱（MS/MS）是全面分析生物样本蛋白质含量的主流高通量技术，该方法是推动蛋白质组学发展的基石。近年来，数据非依赖性采集（DIA）策略取得重大进展，实现了对前体离子的无偏、非靶向碎裂。DIA生成的MS/MS谱图因其固有的高度复用性而极具挑战性：每张谱图包含来自多个前体肽段的碎裂产物离子。这种复杂性对从头肽段/蛋白质测序构成尤为严峻的挑战，现有方法难以解决这一复用难题。本文提出DiaTrans——一种基于Transformer架构的深度学习模型，能够从DIA质谱数据中解析肽段序列。实验结果表明，该方法在氨基酸水平上的精确率提升15.14%至34.8%，召回率提升11.62%至31.94%；在肽段水平上的精确率提升59%至81.36%，显著优于现有最优方法（包括DeepNovo-DIA和PepNet）。整合DIA数据与DiaTrans模型有望发现新型肽段，并对生物样本进行更全面的分析。Casanovo-DIA在GNU GPL许可下免费开源，代码托管于https://github.com/Biocomputing-Research-Group/DiaTrans。

相关内容

MASS

关注 0

MASS：IEEE International Conference on Mobile Ad-hoc and Sensor Systems。 Explanation：移动Ad hoc和传感器系统IEEE国际会议。 Publisher：IEEE。 SIT： http://dblp.uni-trier.de/db/conf/mass/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日