评估大型视觉-语言模型在儿童数学奥林匹克竞赛中的表现 (Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads)

Recent years have seen a significant progress in the general-purpose problem solving abilities of large vision and language models (LVLMs), such as ChatGPT, Gemini, etc.; some of these breakthroughs even seem to enable AI models to outperform human abilities in varied tasks that demand higher-order cognitive skills. Are the current large AI models indeed capable of generalized problem solving as humans do? A systematic analysis of AI capabilities for joint vision and text reasoning, however, is missing in the current scientific literature. In this paper, we make an effort towards filling this gap, by evaluating state-of-the-art LVLMs on their mathematical and algorithmic reasoning abilities using visuo-linguistic problems from children's Olympiads. Specifically, we consider problems from the Mathematical Kangaroo (MK) Olympiad, which is a popular international competition targeted at children from grades 1-12, that tests children's deeper mathematical abilities using puzzles that are appropriately gauged to their age and skills. Using the puzzles from MK, we created a dataset, dubbed SMART-840, consisting of 840 problems from years 2020-2024. With our dataset, we analyze LVLMs power on mathematical reasoning; their responses on our puzzles offer a direct way to compare against that of children. Our results show that modern LVLMs do demonstrate increasingly powerful reasoning skills in solving problems for higher grades, but lack the foundations to correctly answer problems designed for younger children. Further analysis shows that there is no significant correlation between the reasoning capabilities of AI models and that of young children, and their capabilities appear to be based on a different type of reasoning than the cumulative knowledge that underlies children's mathematics and logic skills.

翻译：近年来，大型视觉与语言模型（LVLMs，如ChatGPT、Gemini等）在通用问题解决能力方面取得了显著进展；其中一些突破甚至使得AI模型在需要高阶认知技能的多样化任务中表现出超越人类的能力。当前的大型AI模型是否真能像人类一样进行泛化问题求解？然而，现有科学文献中缺乏对AI在视觉与文本联合推理能力方面的系统分析。本文通过使用儿童奥林匹克竞赛中的视觉语言问题评估最先进LVLMs的数学与算法推理能力，致力于填补这一空白。具体而言，我们选取国际流行的"数学袋鼠"（MK）奥林匹克竞赛题目，该竞赛面向1-12年级儿童，通过适配不同年龄与技能水平的谜题测试儿童深层次数学能力。基于MK谜题，我们构建了包含2020-2024年间840道题目的SMART-840数据集。借助该数据集，我们系统分析了LVLMs的数学推理能力；模型对谜题的响应为直接对比儿童表现提供了途径。研究结果表明：现代LVLMs在解决高年级问题时确实展现出日益强大的推理技能，但缺乏正确解答低龄儿童设计问题的基础能力。进一步分析显示，AI模型的推理能力与低龄儿童的认知能力无显著相关性，其能力似乎建立在与儿童数学逻辑技能所依赖的累积性知识截然不同的推理机制之上。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日