ScratchEval : A Multimodal Evaluation Framework for LLMs in Block-Based Programming

LLMs have achieved strong performance on text-based programming tasks, yet they remain unreliable for block-based languages such as Scratch. Scratch programs exhibit deeply nested, non-linear structures, event-driven concurrency across multiple sprites, and tight coupling between code and multimedia assets, properties that differ fundamentally from textual code. As a result, LLMs often misinterpret Scratch semantics and generate large, invasive edits that are syntactically valid but semantically incorrect when repairing buggy programs. We introduce ScratchEval, the first executable benchmark designed to evaluate LLM-based repair for Scratch programs, covering program understanding, debugging, analysis, and repair. The benchmark contains 100 curated Scratch projects from the public repository, selected for structural and semantic complexity. Each project is paired with executable test suites, bug descriptions with corresponding fixes, block-level edit constraints defining minimal semantically correct repairs, and required multimedia assets. The benchmark is constructed through a human-in-the-loop pipeline combining automated project mining with expert validation of trigger-outcome semantics and representative bug patterns, with emphasis on event ordering, concurrency, and state management. To enable rigorous and reproducible evaluation, we propose a three-layer executable protocol measuring functional correctness via VM-level execution, repair quality using block-level edit distance and behavioral trajectory comparisons, and explanation quality via structured rubrics assessing alignment between model reasoning and generated patches. Using ScratchEval, we study domain-specific fine-tuning, training data effectiveness, and model generalization to unseen bug types. ScratchEval provides a reproducible foundation for evaluating and post-training LLMs on block-based programming tasks.

翻译：LLM在基于文本的编程任务上已展现出强大性能，但在Scratch等积木式语言中仍不可靠。Scratch程序具有深度嵌套的非线性结构、跨多个角色的事件驱动并发性，以及代码与多媒体资源间的紧密耦合，这些特性与文本代码存在根本差异。因此，LLM在修复含缺陷程序时经常误解Scratch语义，生成语法有效但语义错误的大规模侵入式编辑。我们提出ScratchEval——首个专为评估Scratch程序LLM修复能力设计的可执行基准，涵盖程序理解、调试、分析与修复。该基准包含从公共仓库精选的100个Scratch项目，均具有结构性与语义复杂性。每个项目均配备可执行测试套件、含对应修复方案的缺陷描述、定义最小语义正确修复的积木级编辑约束，以及必需的多媒体资源。基准通过人机协同流程构建，结合自动化项目挖掘与专家对触发-结果语义及典型缺陷模式的验证，重点关注事件排序、并发性与状态管理。为实现严谨可复现的评估，我们提出三层可执行协议：通过虚拟机级执行衡量功能正确性，运用积木级编辑距离与行为轨迹比较评估修复质量，借助结构化评分标准衡量模型推理与生成补丁的匹配度来评估解释质量。基于ScratchEval，我们研究了领域特定微调、训练数据有效性及模型对未知缺陷类型的泛化能力。ScratchEval为积木式编程任务的LLM评估与后训练提供了可复现的基础框架。