Today's computer vision models achieve human or near-human level performance across a wide variety of vision tasks. However, their architectures, data, and learning algorithms differ in numerous ways from those that give rise to human vision. In this paper, we investigate the factors that affect the alignment between the representations learned by neural networks and human mental representations inferred from behavioral responses. We find that model scale and architecture have essentially no effect on the alignment with human behavioral responses, whereas the training dataset and objective function both have a much larger impact. These findings are consistent across three datasets of human similarity judgments collected using two different tasks. Linear transformations of neural network representations learned from behavioral responses from one dataset substantially improve alignment with human similarity judgments on the other two datasets. In addition, we find that some human concepts such as food and animals are well-represented by neural networks whereas others such as royal or sports-related objects are not. Overall, although models trained on larger, more diverse datasets achieve better alignment with humans than models trained on ImageNet alone, our results indicate that scaling alone is unlikely to be sufficient to train neural networks with conceptual representations that match those used by humans.
翻译:当今的计算机视觉模型在多种视觉任务上达到了人类或接近人类的性能水平。然而,它们的架构、数据和学习算法在诸多方面与产生人类视觉的机制不同。本文研究了影响神经网络学习表示与从行为反应推断出的人类心理表示之间对齐的因素。我们发现,模型规模与架构对人类行为反应的对齐几乎没有影响,而训练数据集与目标函数则具有显著更大的影响。这一发现与通过两种不同任务收集的三个人类相似性判断数据集相一致。从单一数据集的行为反应中学习到的神经网络表示的线性变换,能够显著提升与其他两个数据集上人类相似性判断的对齐程度。此外,我们发现某些人类概念(如食物和动物)能被神经网络很好地表示,而另一些概念(如皇家或运动相关物体)则不然。总体而言,尽管在更大、更多样化的数据集上训练的模型比仅在ImageNet上训练的模型具有更好的与人类对齐效果,但我们的结果表明,仅靠扩展规模可能不足以训练出匹配人类概念表示的神经网络。