AI co-scientists are increasingly used for scientific discovery, but current evaluations still do not test them on a key task: moving from a concrete scientific or technological problem to a plausible, mechanism-grounded solution hypothesis. This gap is especially important in materials science and, in particular, battery research, where a useful proposal must identify the relevant failure mode, propose a credible intervention, and explain why that intervention should improve the target property. We introduce Matter to Mechanism, a benchmark for evaluating AI co-scientists on problem-to-hypothesis reasoning in materials science, with a focus on battery materials research. The benchmark contains 2,645 instances derived from scientific publications. Each instance includes a structured problem statement, a candidate solution hypothesis, an explicit reasoning trace, and domain-grounded annotations such as material system, component, failure mode, intervention, mechanism, target property, and claimed outcome. We also introduce a metric suite that measures reasoning fidelity, problem alignment, mechanistic specificity, novelty, plausibility, and problem decomposition quality, and combine them into a composite score. Using this framework, we evaluate several AI co-scientist systems and show that Matter to Mechanism reveals interpretable system differences that are only partially recovered by standard text-similarity metrics. We further show through adversarial stress tests that the aggregate score is more stable than individual metric dimensions under superficial gaming attacks.
翻译:AI共科学家正日益被用于科学发现,但当前的评估仍未测试其一项关键能力:从具体的科学或技术问题出发,提出一个合理且基于机理的解决方案假设。这一差距在材料科学,尤其是电池研究中尤为突出,因为一个有效的提议必须识别出相关的失效模式,提出可信的干预措施,并解释该干预如何改善目标性能。我们提出“从物质到机理”基准,用于评估AI共科学家在材料科学中从问题到假设的推理能力,重点关注电池材料研究。该基准包含2,645个源于科学出版物的实例。每个实例包括结构化的问题陈述、候选解决方案假设、明确的推理路径,以及领域相关的标注,例如材料系统、组件、失效模式、干预措施、机理、目标性能和声称的结果。我们还引入了一套度量标准,用于衡量推理忠实度、问题对齐度、机理特异性、新颖性、合理性和问题分解质量,并将其组合成一个综合得分。利用这一框架,我们评估了多个AI共科学家系统,并表明“从物质到机理”基准揭示了系统间的可解释差异,而这些差异仅能被标准的文本相似性度量部分捕获。我们进一步通过对抗性压力测试表明,在浅层博弈攻击下,综合得分比单个度量维度更为稳定。