Are Deep Neural Networks SMARTer than Second Graders?

Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task and the associated SMART-101 dataset, for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children in the 6--8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning, among others. To scale our dataset towards training deep neural networks, we programmatically generate entirely new instances for each puzzle, while retaining their solution algorithm. To benchmark performances on SMART-101, we propose a vision and language meta-learning model using varied state-of-the-art backbones. Our experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization. We also evaluate the recent ChatGPT and other large language models on a part of SMART-101 and find that while these models show convincing reasoning abilities, the answers are often incorrect.

翻译：近年来，深度神经网络在解决需要高级认知能力的任务（如下围棋、生成艺术品、ChatGPT 等）中应用日益增多。如此迅猛的进展引发了一个问题：神经网络在解决需要广泛技能的问题时，其泛化能力究竟如何？为回答此问题，我们提出 SMART：一个简单多模态算法推理任务及其配套的 SMART-101 数据集，用于评估神经网络在解决专为 6-8 岁儿童设计的视觉语言谜题中的抽象、演绎和泛化能力。我们的数据集包含 101 个独特谜题；每个谜题由一张图片和一个问题组成，其解答需要算术、代数、空间推理等多种基本技能的混合运用。为将数据集扩展到可用于训练深度神经网络的规模，我们针对每个谜题以编程方式生成全新实例，同时保留其求解算法。为在 SMART-101 上建立性能基准，我们提出一种利用多种最先进骨干网络的视觉与语言元学习模型。实验表明，尽管强大的深度模型在有监督设定下对谜题表现出合理表现，但在泛化分析中其准确率并不优于随机水平。我们还评估了近期 ChatGPT 及其他大语言模型在 SMART-101 部分子集上的表现，发现尽管这些模型展现出令人信服的推理能力，但答案常常不正确。

相关内容

Neural Networks

关注 1654

神经网络（Neural Networks）是世界上三个最古老的神经建模学会的档案期刊:国际神经网络学会(INNS)、欧洲神经网络学会(ENNS)和日本神经网络学会(JNNS)。神经网络提供了一个论坛，以发展和培育一个国际社会的学者和实践者感兴趣的所有方面的神经网络和相关方法的计算智能。神经网络欢迎高质量论文的提交，有助于全面的神经网络研究，从行为和大脑建模，学习算法，通过数学和计算分析，系统的工程和技术应用，大量使用神经网络的概念和技术。这一独特而广泛的范围促进了生物和技术研究之间的思想交流，并有助于促进对生物启发的计算智能感兴趣的跨学科社区的发展。因此，神经网络编委会代表的专家领域包括心理学，神经生物学，计算机科学，工程，数学，物理。该杂志发表文章、信件和评论以及给编辑的信件、社论、时事、软件调查和专利信息。文章发表在五个部分之一:认知科学，神经科学，学习系统，数学和计算分析、工程和应用。官网地址：http://dblp.uni-trier.de/db/journals/nn/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日