InteractScience：交互式科学演示代码生成的程序化与视觉基础评估 (InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation)

Large Language Models (LLMs) are increasingly capable of generating complete applications from natural language instructions, creating new opportunities in science and education. In these domains, interactive scientific demonstrations are particularly valuable for explaining concepts, supporting new teaching methods, and presenting research findings. Generating such demonstrations requires models to combine accurate scientific knowledge with the ability to implement interactive front-end code that behaves correctly and responds to user actions. This capability goes beyond the scope of existing benchmarks, which typically evaluate either knowledge question answering without grounding in code or static web code generation without scientific interactivity. To evaluate this integrated ability, we design a hybrid framework that combines programmatic functional testing to rigorously verify interaction logic with visually-grounded qualitative testing to assess rendered outputs against reference snapshots. Building on this framework, we present InteractScience, a benchmark consisting of a substantial set of carefully designed questions across five scientific domains, each paired with unit tests, reference snapshots, and checklists. We evaluate 30 leading open- and closed-source LLMs and report results that highlight ongoing weaknesses in integrating domain knowledge with interactive front-end coding. Our work positions InteractScience as the first benchmark to automatically measure this combined capability with realistic interactive operations, providing a foundation for advancing reliable and educationally useful scientific demonstration code generation. All code and data are publicly available at https://github.com/open-compass/InteractScience.

翻译：大型语言模型（LLM）日益能够根据自然语言指令生成完整的应用程序，为科学和教育领域创造了新的机遇。在这些领域中，交互式科学演示对于解释概念、支持新型教学方法及展示研究成果具有特殊价值。生成此类演示要求模型将精准的科学知识与实现交互式前端代码的能力相结合，确保代码行为正确并能响应用户操作。这一能力超出了现有基准测试的范围——现有基准通常仅评估无代码基础的知识问答，或评估无科学交互性的静态网页代码生成。为评估这种综合能力，我们设计了一个混合框架：通过程序化功能测试严格验证交互逻辑，同时结合基于视觉的定性测试，将渲染输出与参考快照进行比对评估。基于此框架，我们提出了InteractScience基准，该基准包含跨五个科学领域的大量精心设计的问题集，每个问题均配有单元测试、参考快照和检查清单。我们对30个领先的开源与闭源LLM进行了评估，结果表明当前模型在整合领域知识与交互式前端编码方面仍存在明显不足。本工作使InteractScience成为首个通过真实交互操作自动衡量这种综合能力的基准，为推进可靠且具有教育价值的科学演示代码生成奠定了基础。所有代码与数据已公开于https://github.com/open-compass/InteractScience。