Today's computer vision models achieve human or near-human level performance across a wide variety of vision tasks. However, their architectures, data, and learning algorithms differ in numerous ways from those that give rise to human vision. In this paper, we investigate the factors that affect the alignment between the representations learned by neural networks and human mental representations inferred from behavioral responses. We find that model scale and architecture have essentially no effect on the alignment with human behavioral responses, whereas the training dataset and objective function both have a much larger impact. These findings are consistent across three datasets of human similarity judgments collected using two different tasks. Linear transformations of neural network representations learned from behavioral responses from one dataset substantially improve alignment with human similarity judgments on the other two datasets. In addition, we find that some human concepts such as food and animals are well-represented by neural networks whereas others such as royal or sports-related objects are not. Overall, although models trained on larger, more diverse datasets achieve better alignment with humans than models trained on ImageNet alone, our results indicate that scaling alone is unlikely to be sufficient to train neural networks with conceptual representations that match those used by humans.
翻译:当今计算机视觉模型在各种视觉任务中达到了人类或接近人类的水平。然而,它们的架构、数据和学习算法在诸多方面与产生人类视觉的机制存在差异。本文研究了影响神经网络学习表示与人类行为反应推断出的心理表征之间对齐程度的因素。我们发现,模型规模和架构对人类行为反应的对齐性几乎没有影响,而训练数据集和目标函数则具有显著更大的影响。这一发现在使用两种不同任务收集的三个人类相似性判断数据集上保持一致。对神经网络表示进行线性变换(该变换基于一个数据集的行为反应学习得到)能大幅提升其在另外两个数据集上与人类相似性判断的对齐程度。此外,我们发现某些人类概念(如食物和动物)能被神经网络充分表征,而其他概念(如皇家或体育相关物体)则难以表征。总体而言,尽管在更大、更多样化的数据集上训练的模型比仅在ImageNet上训练的模型实现了更好的人类对齐,但我们的结果表明,仅靠规模扩展不太可能训练出与人类概念表征相匹配的神经网络。