Vision transformers (ViTs) are top performing models on many computer vision benchmarks and can accurately predict human behavior on object recognition tasks. However, researchers question the value of using ViTs as models of biological learning because ViTs are thought to be more data hungry than brains, with ViTs requiring more training data to reach similar levels of performance. To test this assumption, we directly compared the learning abilities of ViTs and animals, by performing parallel controlled rearing experiments on ViTs and newborn chicks. We first raised chicks in impoverished visual environments containing a single object, then simulated the training data available in those environments by building virtual animal chambers in a video game engine. We recorded the first-person images acquired by agents moving through the virtual chambers and used those images to train self supervised ViTs that leverage time as a teaching signal, akin to biological visual systems. When ViTs were trained through the eyes of newborn chicks, the ViTs solved the same view invariant object recognition tasks as the chicks. Thus, ViTs were not more data hungry than newborn visual systems: both learned view invariant object representations in impoverished visual environments. The flexible and generic attention based learning mechanism in ViTs combined with the embodied data streams available to newborn animals appears sufficient to drive the development of animal-like object recognition.
翻译:视觉Transformer(ViTs)在众多计算机视觉基准测试中表现出色,并能准确预测人类在物体识别任务中的行为。然而,研究者质疑将ViTs作为生物学习模型的价值,因为人们认为ViTs比大脑更"数据饥渴"——需要更多训练数据才能达到相近的性能水平。为验证这一假设,我们通过平行对照饲养实验,直接比较了ViTs与新生小鸡的学习能力。首先在仅含单一物体的贫瘠视觉环境中饲养小鸡,随后通过视频游戏引擎构建虚拟动物笼舍,模拟这些环境中的可用训练数据。我们记录智能体在虚拟空间中移动时采集的第一人称图像,利用这些图像训练自监督ViTs——其利用时间作为教学信号,与生物视觉系统机制相似。当ViTs通过新生小鸡的视角进行训练时,它们解决了与小鸡相同的视角不变物体识别任务。由此可见,ViTs并非比新生视觉系统更"数据饥渴":在贫瘠视觉环境中,两者均能学习视角不变的物体表征。ViTs灵活通用的基于注意力的学习机制,结合新生动物可获取的具身化数据流,足以驱动类动物物体识别能力的发展。