Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task and the associated SMART-101 dataset, for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children in the 6--8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning, among others. To scale our dataset towards training deep neural networks, we programmatically generate entirely new instances for each puzzle, while retaining their solution algorithm. To benchmark performances on SMART-101, we propose a vision and language meta-learning model using varied state-of-the-art backbones. Our experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization. We also evaluate the recent ChatGPT and other large language models on a part of SMART-101 and find that while these models show convincing reasoning abilities, the answers are often incorrect.
翻译:近年来,深度神经网络在解决需要高级认知能力的任务(如下围棋、生成艺术品、ChatGPT 等)中应用日益增多。如此迅猛的进展引发了一个问题:神经网络在解决需要广泛技能的问题时,其泛化能力究竟如何?为回答此问题,我们提出 SMART:一个简单多模态算法推理任务及其配套的 SMART-101 数据集,用于评估神经网络在解决专为 6-8 岁儿童设计的视觉语言谜题中的抽象、演绎和泛化能力。我们的数据集包含 101 个独特谜题;每个谜题由一张图片和一个问题组成,其解答需要算术、代数、空间推理等多种基本技能的混合运用。为将数据集扩展到可用于训练深度神经网络的规模,我们针对每个谜题以编程方式生成全新实例,同时保留其求解算法。为在 SMART-101 上建立性能基准,我们提出一种利用多种最先进骨干网络的视觉与语言元学习模型。实验表明,尽管强大的深度模型在有监督设定下对谜题表现出合理表现,但在泛化分析中其准确率并不优于随机水平。我们还评估了近期 ChatGPT 及其他大语言模型在 SMART-101 部分子集上的表现,发现尽管这些模型展现出令人信服的推理能力,但答案常常不正确。