Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task and the associated SMART-101 dataset, for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children in the 6--8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning, among others. To scale our dataset towards training deep neural networks, we programmatically generate entirely new instances for each puzzle, while retaining their solution algorithm. To benchmark performances on SMART-101, we propose a vision and language meta-learning model using varied state-of-the-art backbones. Our experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization. We also evaluate the recent ChatGPT and other large language models on a subset of SMART-101 and find that while these models show convincing reasoning abilities, the answers are often incorrect.
翻译:近年来,深度神经网络在解决需要高级认知能力的任务(如围棋对弈、艺术创作、ChatGPT等)中的应用日益增多。这一显著进展引发了一个问题:神经网络在解决需要广泛技能的问题时,其泛化能力究竟如何?为回答这一问题,我们提出了SMART:一个简单多模态算法推理任务及其配套的SMART-101数据集,用于评估神经网络在解决专为6-8岁儿童设计的视觉语言谜题时的抽象、演绎和泛化能力。该数据集包含101个独特谜题,每个谜题由一幅图片和一个问题组成,其解答需要综合运用算术、代数、空间推理等多种基础技能。为扩展数据集以训练深度神经网络,我们通过程序化方式为每个谜题生成全新实例,同时保留其求解算法。为在SMART-101上建立性能基准,我们提出了一种基于视觉与语言的元学习模型,并采用了多种先进的骨干网络。实验表明,尽管强大的深度模型在有监督设置下对谜题展现出合理性能,但在分析泛化能力时,其准确率并不优于随机水平。我们还在SMART-101子集上评估了近期出现的ChatGPT及其他大型语言模型,发现这些模型虽展现出令人信服的推理能力,但回答往往存在错误。