Humans effortlessly infer the 3D shape of objects. What computations underlie this ability? Although various computational models have been proposed, none of them capture the human ability to match object shape across viewpoints. Here, we ask whether and how this gap might be closed. We begin with a relatively novel class of computational models, 3D neural fields, which encapsulate the basic principles of classic analysis-by-synthesis in a deep neural network (DNN). First, we find that a 3D Light Field Network (3D-LFN) supports 3D matching judgments well aligned to humans for within-category comparisons, adversarially-defined comparisons that accentuate the 3D failure cases of standard DNN models, and adversarially-defined comparisons for algorithmically generated shapes with no category structure. We then investigate the source of the 3D-LFN's ability to achieve human-aligned performance through a series of computational experiments. Exposure to multiple viewpoints of objects during training and a multi-view learning objective are the primary factors behind model-human alignment; even conventional DNN architectures come much closer to human behavior when trained with multi-view objectives. Finally, we find that while the models trained with multi-view learning objectives are able to partially generalize to new object categories, they fall short of human alignment. This work provides a foundation for understanding human shape inferences within neurally mappable computational architectures.
翻译:人类能够毫不费力地推断物体的三维形状。哪些计算机制支撑了这一能力?尽管已有多种计算模型被提出,但尚无模型能像人类一样跨视角匹配物体形状。在此,我们探究如何以及是否能够弥合这一差距。我们首先关注一类相对新颖的计算模型——三维神经场,它将经典分析-综合范式的基本原理封装在深度神经网络(DNN)中。研究发现,三维光场网络(3D-LFN)在类别内比较、针对标准DNN模型三维缺陷设计的对抗性比较,以及针对无类别结构的算法生成形状的对抗性比较中,均能生成与人类高度对齐的三维匹配判断。随后,我们通过一系列计算实验探究3D-LFN实现人类对齐性能的来源:训练过程中接触物体的多视角图像以及多视角学习目标是模型与人类对齐的主要因素;即便是传统DNN架构,在使用多视角目标训练时也会更接近人类行为。最后,我们发现虽然经多视角学习目标训练的模型能部分泛化至新物体类别,但仍未达到人类对齐水平。本研究为在神经可映射计算架构中理解人类形状推断奠定了基础。