NMRGym: A Comprehensive Benchmark for Nuclear Magnetic Resonance Based Molecular Structure Elucidation

Nuclear Magnetic Resonance (NMR) spectroscopy is the cornerstone of small-molecule structure elucidation. While deep learning has demonstrated significant potential in automating structure elucidation and spectral simulation, current progress is severely impeded by the reliance on synthetic datasets, which introduces significant domain shifts when applied to real-world experimental spectra. Furthermore, the lack of standardized evaluation protocols and rigorous data splitting strategies frequently leads to unfair comparisons and data leakage. To address these challenges, we introduce \textbf{NMRGym}, the largest and most comprehensive standardized dataset and benchmark derived from high-quality experimental NMR data to date. Comprising \textbf{269,999} unique molecules paired with high-fidelity $^1$H and $^{13}$C spectra, NMRGym bridges the critical gap between synthetic approximations and real-world diversity. We implement a strict quality control pipeline and unify data formats to ensure fair comparison. To strictly prevent data leakage, we enforce a scaffold-based split. Additionally, we provide fine-grained peak-atom level annotations to support future usage. Leveraging this resource, we establish a comprehensive evaluation suite covering diverse downstream tasks, including structure elucidation, functional group prediction from NMR, toxicity prediction from NMR, and spectral simulation, benchmarking representative state-of-the-art methodologies. Finally, we release an open-source leadboard with an automated leaderboard to foster community collaboration and standardize future research. The dataset, benchmark and leaderboard are publicly available at \textcolor{blue}{https://AIMS-Lab-HKUSTGZ.github.io/NMRGym/}.

翻译：核磁共振（NMR）波谱学是小分子结构解析的基石。尽管深度学习在自动化结构解析和谱图模拟方面展现出巨大潜力，但当前进展严重依赖于合成数据集，这导致应用于真实世界实验谱图时产生显著的领域偏移。此外，缺乏标准化的评估协议和严格的数据划分策略，常常导致不公平的比较和数据泄露。为应对这些挑战，我们引入了 \textbf{NMRGym}，这是迄今为止从高质量实验NMR数据中构建的最大、最全面的标准化数据集和基准。NMRGym 包含 \textbf{269,999} 个独特分子及其对应的高保真 $^1$H 和 $^{13}$C 谱图，弥合了合成近似与真实世界多样性之间的关键鸿沟。我们实施了严格的质量控制流程并统一了数据格式，以确保公平比较。为严格防止数据泄露，我们强制执行基于分子骨架的划分。此外，我们提供了细粒度的峰-原子级别标注以支持未来应用。利用这一资源，我们建立了一个覆盖多种下游任务的综合评估套件，包括结构解析、基于NMR的官能团预测、基于NMR的毒性预测以及谱图模拟，并对代表性的最先进方法进行了基准测试。最后，我们发布了一个带有自动化排行榜的开源排行榜，以促进社区合作并标准化未来的研究。数据集、基准和排行榜公开于 \textcolor{blue}{https://AIMS-Lab-HKUSTGZ.github.io/NMRGym/}。