As part of human core knowledge, the representation of objects is the building block of mental representation that supports high-level concepts and symbolic reasoning. While humans develop the ability of perceiving objects situated in 3D environments without supervision, models that learn the same set of abilities with similar constraints faced by human infants are lacking. Towards this end, we developed a novel network architecture that simultaneously learns to 1) segment objects from discrete images, 2) infer their 3D locations, and 3) perceive depth, all while using only information directly available to the brain as training data, namely: sequences of images and self-motion. The core idea is treating objects as latent causes of visual input which the brain uses to make efficient predictions of future scenes. This results in object representations being learned as an essential byproduct of learning to predict.
翻译:作为人类核心知识的一部分,物体表征是支持高级概念与符号推理的心理表征基础。尽管人类无需监督即可发展出感知三维环境中物体的能力,但目前尚缺乏在类似婴儿所受约束条件下学习相同能力的模型。为此,我们提出了一种新型网络架构,可同时学习:1)从离散图像中分割物体,2)推断其三维空间位置,3)感知深度信息。该模型仅使用大脑可直接获取的信息(即图像序列与自运动)作为训练数据。核心思想是将物体视为视觉输入的潜在成因,大脑利用这些成因对未来场景进行高效预测。由此,物体表征作为学习预测过程的必要副产品得以习得。