arxiv:2601.15763

NMRGym: A Comprehensive Benchmark for Nuclear Magnetic Resonance Based Molecular Structure Elucidation

Published on Jan 22

Authors:

Abstract

NMRGym presents the largest standardized benchmark dataset for NMR spectroscopy, addressing domain shift issues through high-quality experimental data and rigorous evaluation protocols.

AI-generated summary

Nuclear Magnetic Resonance (NMR) spectroscopy is the cornerstone of small-molecule structure elucidation. While deep learning has demonstrated significant potential in automating structure elucidation and spectral simulation, current progress is severely impeded by the reliance on synthetic datasets, which introduces significant domain shifts when applied to real-world experimental spectra. Furthermore, the lack of standardized evaluation protocols and rigorous data splitting strategies frequently leads to unfair comparisons and data leakage. To address these challenges, we introduce NMRGym, the largest and most comprehensive standardized dataset and benchmark derived from high-quality experimental NMR data to date. Comprising 269,999 unique molecules paired with high-fidelity ^1H and ^{13}C spectra, NMRGym bridges the critical gap between synthetic approximations and real-world diversity. We implement a strict quality control pipeline and unify data formats to ensure fair comparison. To strictly prevent data leakage, we enforce a scaffold-based split. Additionally, we provide fine-grained peak-atom level annotations to support future usage. Leveraging this resource, we establish a comprehensive evaluation suite covering diverse downstream tasks, including structure elucidation, functional group prediction from NMR, toxicity prediction from NMR, and spectral simulation, benchmarking representative state-of-the-art methodologies. Finally, we release an open-source leadboard with an automated leaderboard to foster community collaboration and standardize future research. The dataset, benchmark and leaderboard are publicly available at blue{https://AIMS-Lab-HKUSTGZ.github.io/NMRGym/}.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.15763 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.15763 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.15763 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.