Molecular function is largely determined by structure. Accurately aligning molecular structure with natural language is therefore essential for enabling large language models (LLMs) to reason about downstream chemical tasks. However, the substantial cost of human annotation makes it infeasible to construct large-scale, high-quality datasets of structure-grounded descriptions. In this work, we propose a fully automated annotation framework for generating precise molecular structure descriptions at scale. Our approach builds upon and extends a rule-based chemical nomenclature parser to interpret IUPAC names and construct enriched, structured XML metadata that explicitly encodes molecular structure. This metadata is then used to guide LLMs in producing accurate natural-language descriptions. Using this framework, we curate a large-scale dataset of approximately $163$k molecule-description pairs. A rigorous validation protocol combining LLM-based and expert human evaluation on a subset of $2,000$ molecules demonstrates a high description precision of $98.6\%$. The resulting dataset provides a reliable foundation for future molecule-language alignment, and the proposed annotation method is readily extensible to larger datasets and broader chemical tasks that rely on structural descriptions.
翻译:分子功能主要由结构决定。因此,准确地将分子结构与自然语言对齐对于使大语言模型(LLMs)能够推理下游化学任务至关重要。然而,人工标注的巨大成本使得构建大规模、高质量的结构描述数据集变得不可行。在本工作中,我们提出了一种全自动的标注框架,用于大规模生成精确的分子结构描述。我们的方法基于并扩展了一种基于规则的化学命名法解析器,以解释IUPAC名称并构建丰富的、结构化的XML元数据,该元数据明确编码了分子结构。然后,该元数据被用于引导LLMs生成准确的自然语言描述。利用该框架,我们整理了一个包含约$163$k个分子-描述对的大规模数据集。在一个包含$2,000$个分子的子集上,结合基于LLM的评估和专家人工评估的严格验证协议显示,描述精确度高达$98.6\%$。所得数据集为未来的分子-语言对齐提供了可靠基础,并且所提出的标注方法易于扩展到更大的数据集以及依赖结构描述的更广泛的化学任务。