SketchJudge：基于多模态大语言模型的手绘图表评分诊断基准 (SketchJudge: A Diagnostic Benchmark for Grading Hand-drawn Diagrams with Multimodal Large Language Models)

While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual understanding, they often struggle when faced with the unstructured and ambiguous nature of human-generated sketches. This limitation is particularly pronounced in the underexplored task of visual grading, where models should not only solve a problem but also diagnose errors in hand-drawn diagrams. Such diagnostic capabilities depend on complex structural, semantic, and metacognitive reasoning. To bridge this gap, we introduce SketchJudge, a novel benchmark tailored for evaluating MLLMs as graders of hand-drawn STEM diagrams. SketchJudge encompasses 1,015 hand-drawn student responses across four domains: geometry, physics, charts, and flowcharts, featuring diverse stylistic variations and distinct error types. Evaluations on SketchJudge demonstrate that even advanced MLLMs lag significantly behind humans, validating the benchmark's effectiveness in exposing the fragility of current vision-language alignment in symbolic and noisy contexts. All data, code, and evaluation scripts are publicly available at https://github.com/yuhangsu82/SketchJudge.

翻译：尽管多模态大语言模型（MLLMs）在视觉理解方面取得了显著进展，但在处理人类手绘草图的无结构性和模糊性时仍面临困难。这一局限在尚未充分探索的视觉评分任务中尤为突出——该任务不仅要求模型解决问题，还需诊断手绘图表中的错误。此类诊断能力依赖于复杂的结构、语义及元认知推理。为弥补这一差距，我们提出了SketchJudge，一个专为评估MLLMs作为手绘STEM图表评分员而设计的新型基准。SketchJudge涵盖几何、物理、图表和流程图四个领域共1,015份手绘学生作答样本，包含多样化的风格变异和明确的错误类型。在SketchJudge上的评估表明，即使先进的MLLMs也显著落后于人类水平，验证了该基准在揭示符号与噪声语境下当前视觉-语言对齐机制脆弱性方面的有效性。所有数据、代码与评估脚本均已公开于https://github.com/yuhangsu82/SketchJudge。