Recent years have seen a significant progress in the general-purpose problem solving abilities of large vision and language models (LVLMs), such as ChatGPT, Gemini, etc.; some of these breakthroughs even seem to enable AI models to outperform human abilities in varied tasks that demand higher-order cognitive skills. Are the current large AI models indeed capable of generalized problem solving as humans do? A systematic analysis of AI capabilities for joint vision and text reasoning, however, is missing in the current scientific literature. In this paper, we make an effort towards filling this gap, by evaluating state-of-the-art LVLMs on their mathematical and algorithmic reasoning abilities using visuo-linguistic problems from children's Olympiads. Specifically, we consider problems from the Mathematical Kangaroo (MK) Olympiad, which is a popular international competition targeted at children from grades 1-12, that tests children's deeper mathematical abilities using puzzles that are appropriately gauged to their age and skills. Using the puzzles from MK, we created a dataset, dubbed SMART-840, consisting of 840 problems from years 2020-2024. With our dataset, we analyze LVLMs power on mathematical reasoning; their responses on our puzzles offer a direct way to compare against that of children. Our results show that modern LVLMs do demonstrate increasingly powerful reasoning skills in solving problems for higher grades, but lack the foundations to correctly answer problems designed for younger children. Further analysis shows that there is no significant correlation between the reasoning capabilities of AI models and that of young children, and their capabilities appear to be based on a different type of reasoning than the cumulative knowledge that underlies children's mathematics and logic skills.
翻译:近年来,大型视觉与语言模型(LVLMs,如ChatGPT、Gemini等)在通用问题解决能力方面取得了显著进展;其中一些突破甚至使得AI模型在需要高阶认知技能的多样化任务中表现出超越人类的能力。当前的大型AI模型是否真能像人类一样进行泛化问题求解?然而,现有科学文献中缺乏对AI在视觉与文本联合推理能力方面的系统分析。本文通过使用儿童奥林匹克竞赛中的视觉语言问题评估最先进LVLMs的数学与算法推理能力,致力于填补这一空白。具体而言,我们选取国际流行的"数学袋鼠"(MK)奥林匹克竞赛题目,该竞赛面向1-12年级儿童,通过适配不同年龄与技能水平的谜题测试儿童深层次数学能力。基于MK谜题,我们构建了包含2020-2024年间840道题目的SMART-840数据集。借助该数据集,我们系统分析了LVLMs的数学推理能力;模型对谜题的响应为直接对比儿童表现提供了途径。研究结果表明:现代LVLMs在解决高年级问题时确实展现出日益强大的推理技能,但缺乏正确解答低龄儿童设计问题的基础能力。进一步分析显示,AI模型的推理能力与低龄儿童的认知能力无显著相关性,其能力似乎建立在与儿童数学逻辑技能所依赖的累积性知识截然不同的推理机制之上。