Test-Time Tuned Language Models Enable End-to-end De Novo Molecular Structure Generation from MS/MS Spectra

Tandem Mass Spectrometry is a cornerstone technique for identifying unknown small molecules in fields such as metabolomics, natural product discovery and environmental analysis. However, certain aspects, such as the probabilistic fragmentation process and size of the chemical space, make structure elucidation from such spectra highly challenging, particularly when there is a shift between the deployment and training conditions. Current methods rely on database matching of previously observed spectra of known molecules and multi-step pipelines that require intermediate fingerprint prediction or expensive fragment annotations. We introduce a novel end-to-end framework based on a transformer model that directly generates molecular structures from an input tandem mass spectrum and its corresponding molecular formula, thereby eliminating the need for manual annotations and intermediate steps, while leveraging transfer learning from simulated data. To further address the challenge of out-of-distribution spectra, we introduce a test-time tuning strategy that dynamically adapts the pre-trained model to novel experimental data. Our approach achieves a Top-1 accuracy of 3.16% on the MassSpecGym benchmark and 12.88% on the NPLIB1 datasets, considerably outperforming conventional fine-tuning. Baseline approaches are also surpassed by 27% and 67% respectively. Even when the exact reference structure is not recovered, the generated candidates are chemically informative, exhibiting high structural plausibility as reflected by strong Tanimoto similarity to the ground truth. Notably, we observe a relative improvement in average Tanimoto similarity of 83% on NPLIB1 and 64% on MassSpecGym compared to state-of-the-art methods. Our framework combines simplicity with adaptability, generating accurate molecular candidates that offer valuable guidance for expert interpretation of unseen spectra.

翻译：串联质谱是代谢组学、天然产物发现和环境分析等领域中鉴定未知小分子的基石技术。然而，其概率性裂解过程及化学空间规模等因素，使得从这类谱图中解析结构极具挑战性，尤其在部署条件与训练条件存在分布偏移时更为突出。现有方法依赖于已知分子已观测谱图的数据库匹配，以及需要中间指纹预测或昂贵碎片注释的多步骤流程。本文提出一种基于Transformer模型的端到端新框架，可直接从输入的串联质谱图及其对应分子式生成分子结构，从而在利用模拟数据迁移学习的同时，消除了人工注释与中间步骤的需求。为应对分布外谱图的挑战，我们进一步引入测试时调优策略，使预训练模型能够动态适应新的实验数据。本方法在MassSpecGym基准测试中达到3.16%的Top-1准确率，在NPLIB1数据集中达到12.88%的准确率，显著优于传统微调方法，并分别以27%和67%的优势超越基线方法。即使未能完全复原精确参考结构，生成的候选分子仍具有化学信息价值，展现出较高的结构合理性——其与真实结构的Tanimoto相似度指标充分反映了这一特性。值得注意的是，相较于现有最优方法，我们在NPLIB1数据集上实现了平均Tanimoto相似度83%的相对提升，在MassSpecGym上实现了64%的提升。本框架兼具简洁性与适应性，能生成准确的分子候选结构，为专家解析未知谱图提供重要参考依据。