AgentCAT: An LLM Agent for Extracting and Analyzing Catalytic Reaction Data from Chemical Engineering Literature

This paper presents a large language model (LLM) agent named AgentCAT, which extracts and analyzes catalytic reaction data from chemical engineering papers, %and supports natural language based interactive analysis of the extracted data. AgentCAT serves as an alternative to overcome the long-standing data bottleneck in chemical engineering field, and its natural language based interactive data analysis functionality is friendly to the community. AgentCAT also presents a formal abstraction and challenge analysis of the catalytic reaction data extraction task in an artificial intelligence-friendly manner. This abstraction would help the artificial intelligence community understand this problem and in turn would attract more attention to address it. Technically, the complex catalytic process leads to complicated dependency structure in catalytic reaction data with respect to elementary reaction steps, molecular behaviors, measurement evidence, etc. This dependency structure makes it challenging to guarantee the correctness and completeness of data extraction, as well as representing them for analysis. AgentCAT addresses this challenge and it makes four folds of technical contributions: (1) a schema-governed extraction pipeline with progressive schema evolution, enabling robust data extraction from chemical engineering papers; (2) a dependency-aware reaction-network knowledge graph that links catalysts/active sites, synthesis-derived descriptors, mechanistic claims with evidence, and macroscopic outcomes, preserving process coupling and traceability; (3) a general querying module that supports natural-language exploration and visualization over the constructed graph for cross-paper analysis; (4) an evaluation on $\sim$800 peer-reviewed chemical engineering publications demonstrating the effectiveness of AgentCAT.

翻译：本文提出了一种名为AgentCAT的大语言模型智能体，用于从化学工程论文中提取和分析催化反应数据，并支持基于自然语言的交互式数据分析。AgentCAT旨在克服化学工程领域长期存在的数据瓶颈问题，其基于自然语言的交互式数据分析功能对研究社区友好。AgentCAT还以人工智能友好的方式，对催化反应数据提取任务进行了形式化抽象与挑战分析。该抽象有助于人工智能社区理解此问题，进而吸引更多关注以推动其解决。从技术角度看，复杂的催化过程导致催化反应数据在基元反应步骤、分子行为、测量证据等方面具有复杂的依赖结构。这种依赖结构使得保证数据提取的正确性与完整性，以及对其进行表示以支持分析变得极具挑战。AgentCAT应对了这一挑战，并作出四方面技术贡献：(1) 采用模式驱动的提取流程并支持渐进式模式演化，实现了从化学工程论文中稳健提取数据；(2) 构建依赖感知的反应网络知识图谱，将催化剂/活性位点、合成衍生描述符、具有证据的机理主张以及宏观结果相互关联，保持了过程耦合与可追溯性；(3) 开发通用查询模块，支持对构建的图谱进行基于自然语言的探索与可视化，实现跨论文分析；(4) 在约800篇同行评审的化学工程出版物上开展评估，证明了AgentCAT的有效性。