The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50\% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.
翻译:多模态大型语言模型(MLLMs)与化学领域的融合有望彻底改变科学发现进程,然而,这些模型对真实文献中密集、图形化的化学反应语言的理解能力仍未得到充分探索。本文提出了RxnBench,一个多层级基准测试,旨在严格评估MLLMs从科学PDF文档中理解化学反应的能力。RxnBench包含两项任务:单图问答(SF-QA),通过从305个精选反应流程图中提取的1,525个问题,测试模型的细粒度视觉感知与机理推理能力;以及全文档问答(FD-QA),要求模型整合108篇学术文献中的信息,实现文本、流程图和表格的跨模态融合。我们对多种MLLMs的评估揭示了一个关键的能力缺口:虽然模型在提取显式文本方面表现出色,但在深层化学逻辑和精确结构识别方面仍存在困难。值得注意的是,具备推理时思考能力的模型显著优于标准架构,但所有模型在FD-QA任务上的准确率均未超过50%。这些发现强调了开发领域专用视觉编码器和更强推理引擎的迫切需求,以推动自主AI化学家的进步。