Humans acquire semantic object representations from egocentric visual streams with minimal supervision. Importantly, the visual system processes with high resolution only the center of its field of view and learns similar representations for visual inputs occurring close in time. This emphasizes slowly changing information around gaze locations. This study investigates the role of central vision and slowness learning in the formation of semantic object representations from human-like visual experience. We simulate five months of human-like visual experience using the Ego4D dataset and generate gaze coordinates with a state-of-the-art gaze prediction model. Using these predictions, we extract crops that mimic central vision and train a time-contrastive Self-Supervised Learning model on them. Our results show that combining temporal slowness and central vision improves the encoding of different semantic facets of object representations. Specifically, focusing on central vision strengthens the extraction of foreground object features, while considering temporal slowness, especially during fixational eye movements, allows the model to encode broader semantic information about objects. These findings provide new insights into the mechanisms by which humans may develop semantic object representations from natural visual experience.
翻译:人类能够以极少监督的方式从自我中心视觉流中获取语义物体表征。重要的是,视觉系统仅以高分辨率处理视野中心区域,并对时间上接近的视觉输入学习相似表征。这强调了注视位置周围缓慢变化的信息。本研究探讨了中心视觉与缓慢性学习在类人视觉经验中形成语义物体表征的作用。我们利用Ego4D数据集模拟了五个月的类人视觉体验,并通过最先进的注视预测模型生成注视坐标。基于这些预测,我们提取模拟中心视觉的图像区域,并在其上训练时间对比自监督学习模型。实验结果表明,结合时间缓慢性与中心视觉能提升物体表征不同语义维度的编码效果。具体而言,聚焦中心视觉能强化前景物体特征的提取,而考虑时间缓慢性(特别是在注视性眼动期间)使模型能够编码更广泛的物体语义信息。这些发现为理解人类如何从自然视觉经验中发展语义物体表征的机制提供了新见解。