Mass spectrometry is a powerful and widely used tool for identifying molecular structures due to its sensitivity and ability to profile complex samples. However, translating spectra into full molecular structures is a difficult, under-defined inverse problem. Overcoming this problem is crucial for enabling biological insight, discovering new metabolites, and advancing chemical research across multiple fields. To this end, we develop MSFlow, a two-stage encoder-decoder flow-matching generative model that achieves state-of-the-art performance on the structure elucidation task for small molecules. In the first stage, we adopt a formula-restricted transformer model for encoding mass spectra into a continuous and chemically informative embedding space, while in the second stage, we train a decoder flow matching model to reconstruct molecules from latent embeddings of mass spectra. We present ablation studies demonstrating the importance of using information-preserving molecular descriptors for encoding mass spectra and motivate the use of our discrete flow-based decoder. Our rigorous evaluation demonstrates that MSFlow can accurately translate up to 45 percent of molecular mass spectra into their corresponding molecular representations - an improvement of up to fourteen-fold over the current state-of-the-art. A trained version of MSFlow is made publicly available on GitHub for non-commercial users.
翻译:质谱分析因其高灵敏度和分析复杂样品的能力,已成为鉴定分子结构的重要且广泛应用的技术。然而,将质谱图转化为完整的分子结构是一个困难且定义不明确的逆问题。解决这一问题对于获取生物学见解、发现新代谢物以及推动多个领域的化学研究至关重要。为此,我们开发了MSFlow,一种两阶段的编码器-解码器流匹配生成模型,该模型在小分子结构解析任务上实现了最先进的性能。在第一阶段,我们采用一种受分子式限制的Transformer模型,将质谱图编码到一个连续且富含化学信息的嵌入空间中;在第二阶段,我们训练一个解码器流匹配模型,从质谱图的潜在嵌入中重建分子。我们通过消融研究证明了使用信息保留型分子描述符对质谱图进行编码的重要性,并论证了我们基于离散流的解码器的优势。严格的评估表明,MSFlow能够将高达45%的分子质谱图准确转化为其对应的分子表示——这比当前最先进方法的性能提升了高达十四倍。MSFlow的训练版本已在GitHub上公开,供非商业用户使用。