AI models have achieved state-of-the-art results in textual reasoning; however, their ability to reason over spatial and relational structures remains a critical bottleneck -- particularly in early-grade maths, which relies heavily on visuals. This paper introduces the visual reasoning benchmark (VRB), a novel dataset designed to evaluate Multimodal Large Language Models (MLLMs) on their ability to solve authentic visual problems from classrooms. This benchmark is built on a set of 701 questions sourced from primary school examinations in Zambia and India, which cover a range of tasks such as reasoning by analogy, pattern completion, and spatial matching. We outline the methodology and development of the benchmark which intentionally uses unedited, minimal-text images to test if models can meet realistic needs of primary education. Our findings reveal a ``jagged frontier'' of capability where models demonstrate better proficiency in static skills such as counting and scaling, but reach a distinct ``spatial ceiling'' when faced with dynamic operations like folding, reflection, and rotation. These weaknesses pose a risk for classroom use on visual reasoning problems, with the potential for incorrect marking, false scaffolding, and reinforcing student misconceptions. Consequently, education-focused benchmarks like the VRB are essential for determining the functional boundaries of multimodal tools used in classrooms.
翻译:人工智能模型在文本推理方面已取得最先进成果;然而,其在空间与关系结构上的推理能力仍存在关键瓶颈——这在高度依赖视觉元素的低年级数学教育中尤为突出。本文提出视觉推理基准(VRB),这是一个旨在评估多模态大语言模型(MLLMs)解决课堂真实视觉问题能力的新型数据集。该基准基于701道来自赞比亚和印度小学考试题构建,涵盖类比推理、模式补全、空间匹配等多种任务类型。我们阐述了该基准的方法论与开发过程,其特意采用未经编辑、文本极简的图像来检验模型能否满足基础教育的实际需求。研究发现模型能力存在"锯齿状边界":在计数、缩放等静态技能上表现较好,但在折叠、反射、旋转等动态操作时达到明显的"空间天花板"。这些缺陷使模型在课堂处理视觉推理问题时存在风险,可能导致错误评分、不当教学支架及强化学生错误概念。因此,像VRB这样面向教育的基准测试对于界定课堂多模态工具的功能边界至关重要。