Are Deep Neural Networks SMARTer than Second Graders?

Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task and the associated SMART-101 dataset, for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children in the 6--8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning, among others. To scale our dataset towards training deep neural networks, we programmatically generate entirely new instances for each puzzle, while retaining their solution algorithm. To benchmark performances on SMART-101, we propose a vision and language meta-learning model using varied state-of-the-art backbones. Our experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization. We also evaluate the recent ChatGPT and other large language models on a subset of SMART-101 and find that while these models show convincing reasoning abilities, the answers are often incorrect.

翻译：近年来，深度神经网络在解决需要高级认知能力的任务（如围棋对弈、艺术创作、ChatGPT等）中的应用日益增多。这一显著进展引发了一个问题：神经网络在解决需要广泛技能的问题时，其泛化能力究竟如何？为回答这一问题，我们提出了SMART：一个简单多模态算法推理任务及其配套的SMART-101数据集，用于评估神经网络在解决专为6-8岁儿童设计的视觉语言谜题时的抽象、演绎和泛化能力。该数据集包含101个独特谜题，每个谜题由一幅图片和一个问题组成，其解答需要综合运用算术、代数、空间推理等多种基础技能。为扩展数据集以训练深度神经网络，我们通过程序化方式为每个谜题生成全新实例，同时保留其求解算法。为在SMART-101上建立性能基准，我们提出了一种基于视觉与语言的元学习模型，并采用了多种先进的骨干网络。实验表明，尽管强大的深度模型在有监督设置下对谜题展现出合理性能，但在分析泛化能力时，其准确率并不优于随机水平。我们还在SMART-101子集上评估了近期出现的ChatGPT及其他大型语言模型，发现这些模型虽展现出令人信服的推理能力，但回答往往存在错误。

相关内容

Neural Networks

关注 1654

神经网络（Neural Networks）是世界上三个最古老的神经建模学会的档案期刊:国际神经网络学会(INNS)、欧洲神经网络学会(ENNS)和日本神经网络学会(JNNS)。神经网络提供了一个论坛，以发展和培育一个国际社会的学者和实践者感兴趣的所有方面的神经网络和相关方法的计算智能。神经网络欢迎高质量论文的提交，有助于全面的神经网络研究，从行为和大脑建模，学习算法，通过数学和计算分析，系统的工程和技术应用，大量使用神经网络的概念和技术。这一独特而广泛的范围促进了生物和技术研究之间的思想交流，并有助于促进对生物启发的计算智能感兴趣的跨学科社区的发展。因此，神经网络编委会代表的专家领域包括心理学，神经生物学，计算机科学，工程，数学，物理。该杂志发表文章、信件和评论以及给编辑的信件、社论、时事、软件调查和专利信息。文章发表在五个部分之一:认知科学，神经科学，学习系统，数学和计算分析、工程和应用。官网地址：http://dblp.uni-trier.de/db/journals/nn/

Nat. Biotechnol. | 机器学习为生物库驱动的药物发现提供动力

专知会员服务

11+阅读 · 2022年9月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日