A molecule's properties are fundamentally determined by its composition and structure encoded in its molecular graph. Thus, reasoning about molecular properties requires the ability to parse and understand the molecular graph. Large Language Models (LLMs) are increasingly applied to chemistry, tackling tasks such as molecular name conversion, captioning, text-guided generation, and property or reaction prediction. Most existing benchmarks emphasize general chemical knowledge, rely on literature or surrogate labels that risk leakage or bias, or reduce evaluation to multiple-choice questions. We introduce MolecularIQ, a molecular structure reasoning benchmark focused exclusively on symbolically verifiable tasks. MolecularIQ enables fine-grained evaluation of reasoning over molecular graphs and reveals capability patterns that localize model failures to specific tasks and molecular structures. This provides actionable insights into the strengths and limitations of current chemistry LLMs and guides the development of models that reason faithfully over molecular structure.
翻译:分子的性质根本上由其分子图所编码的组成与结构决定。因此,对分子性质进行推理需要具备解析和理解分子图的能力。大型语言模型(LLMs)正日益应用于化学领域,处理诸如分子名称转换、描述生成、文本引导的分子生成以及性质或反应预测等任务。现有大多数基准测试侧重于通用化学知识,依赖文献或存在泄漏或偏差风险的替代标签,或将评估简化为多项选择题。我们提出了MolecularIQ,这是一个专门针对符号可验证任务的分子结构推理基准。MolecularIQ支持对分子图推理进行细粒度评估,并揭示将模型失败定位到特定任务和分子结构的能力模式。这为当前化学领域LLMs的优势与局限提供了可操作的见解,并指导开发能够忠实基于分子结构进行推理的模型。