Automated molecular structure elucidation remains challenging, as existing approaches often depend on pre-compiled databases or restrict themselves to single spectroscopic modalities. Here we introduce SpectraLLM, a large language model that performs end-to-end structure prediction by reasoning over one or multiple spectra. Unlike conventional spectrum-to-structure pipelines, SpectraLLM represents both continuous (IR, Raman, UV-Vis, NMR) and discrete (MS) modalities in a shared language space, enabling it to capture substructural patterns that are complementary across different spectral types. We pretrain and fine-tune the model on small-molecule domains and evaluate it on four public benchmark datasets. SpectraLLM achieves state-of-the-art performance, substantially surpassing single-modality baselines. Moreover, it demonstrates strong robustness in unimodal settings and further improves prediction accuracy when jointly reasoning over diverse spectra, establishing a scalable paradigm for language-based spectroscopic analysis. Code is available at https://github.com/OPilgrim/SpectraLLM.
翻译:自动化分子结构解析仍面临挑战,现有方法往往依赖预编译数据库或局限于单一光谱模态。本文提出SpectraLLM——一种通过推理单个或多个光谱实现端到端结构预测的大语言模型。与传统光谱-结构分析流程不同,SpectraLLM将连续型(红外、拉曼、紫外-可见、核磁共振)和离散型(质谱)光谱模态统一映射至共享语言空间,从而捕获不同光谱类型间互补的子结构模式。我们在小分子领域对该模型进行预训练与微调,并在四个公开基准数据集上进行评估。SpectraLLM取得了最先进的性能,显著超越单模态基线方法。此外,它在单模态设置下展现出强鲁棒性,并在联合推理多样化光谱时进一步提升预测精度,为基于语言的谱学分析建立了可扩展范式。代码开源地址:https://github.com/OPilgrim/SpectraLLM