Matter to Mechanism: A Benchmark for AI Co-Scientists in Materials and Battery Research

AI co-scientists are increasingly used for scientific discovery, but current evaluations still do not test them on a key task: moving from a concrete scientific or technological problem to a plausible, mechanism-grounded solution hypothesis. This gap is especially important in materials science and, in particular, battery research, where a useful proposal must identify the relevant failure mode, propose a credible intervention, and explain why that intervention should improve the target property. We introduce Matter to Mechanism, a benchmark for evaluating AI co-scientists on problem-to-hypothesis reasoning in materials science, with a focus on battery materials research. The benchmark contains 2,645 instances derived from scientific publications. Each instance includes a structured problem statement, a candidate solution hypothesis, an explicit reasoning trace, and domain-grounded annotations such as material system, component, failure mode, intervention, mechanism, target property, and claimed outcome. We also introduce a metric suite that measures reasoning fidelity, problem alignment, mechanistic specificity, novelty, plausibility, and problem decomposition quality, and combine them into a composite score. Using this framework, we evaluate several AI co-scientist systems and show that Matter to Mechanism reveals interpretable system differences that are only partially recovered by standard text-similarity metrics. We further show through adversarial stress tests that the aggregate score is more stable than individual metric dimensions under superficial gaming attacks.

翻译：AI共科学家正日益被用于科学发现，但当前的评估仍未测试其一项关键能力：从具体的科学或技术问题出发，提出一个合理且基于机理的解决方案假设。这一差距在材料科学，尤其是电池研究中尤为突出，因为一个有效的提议必须识别出相关的失效模式，提出可信的干预措施，并解释该干预如何改善目标性能。我们提出“从物质到机理”基准，用于评估AI共科学家在材料科学中从问题到假设的推理能力，重点关注电池材料研究。该基准包含2,645个源于科学出版物的实例。每个实例包括结构化的问题陈述、候选解决方案假设、明确的推理路径，以及领域相关的标注，例如材料系统、组件、失效模式、干预措施、机理、目标性能和声称的结果。我们还引入了一套度量标准，用于衡量推理忠实度、问题对齐度、机理特异性、新颖性、合理性和问题分解质量，并将其组合成一个综合得分。利用这一框架，我们评估了多个AI共科学家系统，并表明“从物质到机理”基准揭示了系统间的可解释差异，而这些差异仅能被标准的文本相似性度量部分捕获。我们进一步通过对抗性压力测试表明，在浅层博弈攻击下，综合得分比单个度量维度更为稳定。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

AI能预测科学突破吗？CUSP基准揭示前沿模型能力边界

专知会员服务

9+阅读 · 5月23日

人工智能与材料塑性研究：深度综述

专知会员服务

18+阅读 · 2月4日

人工智能时代的材料生成：一项全面综述

专知会员服务

19+阅读 · 2025年5月24日

《面向科学发现的智能体人工智能：进展、挑战与未来方向综述》

专知会员服务

60+阅读 · 2025年3月14日