Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

Zhou Ziheng,Huacong Tang,Jinyuan Zhang,Haowei Lin,Bangcheng Yang,Qian Long,Fang Sun,Yizhou Sun,Yitao Liang,Ying Nian Wu,Demetri Terzopoulos,Xiaofeng Gao

from arxiv, Preprint, under review. 41 pages. Project page: https://scicrafter-bench.github.io/. Code: https://github.com/scicrafter-bench/scicraft-bench

Discovering causal regularities and applying them to build functional systems--the discovery-to-application loop--is a hallmark of general intelligence, yet evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real-world engineering. We introduce SciCrafter, a Minecraft-based benchmark that operationalizes this loop through parameterized redstone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences); scaling target parameters substantially increases construction complexity and required knowledge, forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5 under a general-purpose code agent scaffold, we find that all plateau at approximately 26% success rate. To diagnose these failures, we decompose the loop into four capacities--knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application--and design targeted interventions whose marginal contributions serve as proxies for corresponding gaps. Our analysis reveals that although the general knowledge application capability still remains as the biggest gap across all models, for frontier models the knowledge gap identification starts to become a major hurdle--indicating the bottleneck is shifting from solving problems right to raising the right problems for current AI. We release SciCrafter as a diagnostic probe for future research on AI systems that navigate the full discovery-to-application loop.

翻译：发现因果规律并将其应用于构建功能性系统——即发现-应用循环——是通用智能的标志性能力，然而评估这一能力长期受限于科学发现与真实工程之间巨大的复杂性鸿沟。我们引入SciCrafter——一个基于《我的世界》的基准测试框架，通过参数化红石电路任务将这一循环操作化。智能体必须按指定模式（如同时点亮或定时序列）激活灯具；扩大目标参数会显著增加构建复杂度与所需知识，迫使智能体进行真实发现而非依赖记忆解法。在通用代码智能体框架下评估包括GPT-5.2、Gemini-3-Pro和Claude-Opus-4.5在内的前沿模型时，我们发现所有模型成功率均停滞在约26%。为诊断这些失败，我们将该循环拆解为四个能力维度——知识缺口识别、实验发现、知识整合与知识应用——并设计针对性干预措施，其边际贡献可作为相应能力缺口的代理指标。分析表明，尽管通用知识应用能力仍是所有模型的最大短板，但对前沿模型而言，知识缺口识别正成为主要障碍——这表明瓶颈正从"正确解决问题"转向"提出正确问题"。我们发布SciCrafter作为诊断探针，以促进未来对能完整运行发现-应用循环的AI系统的研究。