Compound identification and structure annotation from mass spectra is a well-established task widely applied in drug detection, criminal forensics, small molecule biomarker discovery and chemical engineering. We propose SpecTUS: Spectral Translator for Unknown Structures, a deep neural model that addresses the task of structural annotation of small molecules from low-resolution gas chromatography electron ionization mass spectra (GC-EI-MS). Our model analyzes the spectra in \textit{de novo} manner -- a direct translation from the spectra into 2D-structural representation. Our approach is particularly useful for analyzing compounds unavailable in spectral libraries. In a rigorous evaluation of our model on the novel structure annotation task across different libraries, we outperformed standard database search techniques by a wide margin. On a held-out testing set, including \numprint{28267} spectra from the NIST database, we show that our model's single suggestion perfectly reconstructs 43\% of the subset's compounds. This single suggestion is strictly better than the candidate of the database hybrid search (common method among practitioners) in 76\% of cases. In a~still affordable scenario of~10 suggestions, perfect reconstruction is achieved in 65\%, and 84\% are better than the hybrid search.
翻译:基于质谱的化合物鉴定与结构注释是一项成熟的任务,广泛应用于药物检测、刑事取证、小分子生物标志物发现及化学工程领域。我们提出SpecTUS:未知结构光谱翻译器,这是一种深度神经网络模型,旨在解决基于低分辨率气相色谱电子电离质谱(GC-EI-MS)的小分子结构注释任务。我们的模型以从头分析方式处理谱图——直接将谱图翻译为二维结构表示。该方法特别适用于分析谱图库中未收录的化合物。在对不同谱库的新颖结构注释任务进行的严格评估中,我们的模型大幅优于标准数据库检索技术。在包含NIST数据库中28,267个谱图的保留测试集上,我们证明模型的首个建议即可完美重构该子集中43%的化合物。在76%的案例中,该单一建议严格优于数据库混合检索(从业者常用方法)的候选结果。在仍可接受的10个建议场景中,完美重构率达到65%,且84%的结果优于混合检索。