Accurate prediction of nuclear magnetic resonance (NMR) chemical shifts is fundamental to spectral analysis and molecular structure elucidation, yet existing machine learning methods rely on limited, labor-intensive atom-assigned datasets. We propose a semi-supervised framework that learns NMR chemical shifts from millions of literature-extracted spectra without explicit atom-level assignments, integrating a small amount of labeled data with large-scale unassigned spectra. We formulate chemical shift prediction from literature spectra as a permutation-invariant set supervision problem, and show that under commonly satisfied conditions on the loss function, optimal bipartite matching reduces to a sorting-based loss, enabling stable large-scale semi-supervised training beyond traditional curated datasets. Our models achieve substantially improved accuracy and robustness over state-of-the-art methods and exhibit stronger generalization on significantly larger and more diverse molecular datasets. Moreover, by incorporating solvent information at scale, our approach captures systematic solvent effects across common NMR solvents for the first time. Overall, our results demonstrate that large-scale unlabeled spectra mined from the literature can serve as a practical and effective data source for training NMR shift models, suggesting a broader role of literature-derived, weakly structured data in data-centric AI for science.
翻译:核磁共振(NMR)化学位移的准确预测是谱图分析和分子结构解析的基础,然而现有的机器学习方法依赖于有限且劳动密集的原子归属数据集。我们提出了一种半监督框架,该框架可从数百万文献提取的谱图中学习NMR化学位移,而无需明确的原子级归属,其将少量标注数据与大规模未归属谱图相结合。我们将从文献谱图预测化学位移表述为一个置换不变集合监督问题,并证明在损失函数满足常见条件时,最优二分匹配可简化为基于排序的损失函数,从而能够实现超越传统精编数据集的大规模稳定半监督训练。我们的模型相比现有最先进方法实现了显著提升的准确性与鲁棒性,并在更大规模、更多样化的分子数据集上表现出更强的泛化能力。此外,通过大规模整合溶剂信息,我们的方法首次捕捉了常见NMR溶剂中的系统性溶剂效应。总体而言,我们的结果表明,从文献中挖掘的大规模未标注谱图可作为训练NMR位移模型的实用且有效的数据源,这提示了文献衍生的弱结构化数据在以数据为中心的科学人工智能中具有更广泛的作用。