Active Gaze Behavior Boosts Self-Supervised Object Learning

Due to significant variations in the projection of the same object from different viewpoints, machine learning algorithms struggle to recognize the same object across various perspectives. In contrast, toddlers quickly learn to recognize objects from different viewpoints with almost no supervision. Recent works argue that toddlers develop this ability by mapping close-in-time visual inputs to similar representations while interacting with objects. High acuity vision is only available in the central visual field, which may explain why toddlers (much like adults) constantly move their gaze around during such interactions. It is unclear whether/how much toddlers curate their visual experience through these eye movements to support learning object representations. In this work, we explore whether a bio inspired visual learning model can harness toddlers' gaze behavior during a play session to develop view-invariant object recognition. Exploiting head-mounted eye tracking during dyadic play, we simulate toddlers' central visual field experience by cropping image regions centered on the gaze location. This visual stream feeds a time-based self-supervised learning algorithm. Our experiments demonstrate that toddlers' gaze strategy supports the learning of invariant object representations. Our analysis also reveals that the limited size of the central visual field where acuity is high is crucial for this. We further find that toddlers' visual experience elicits more robust representations compared to adults' mostly because toddlers look at objects they hold themselves for longer bouts. Overall, our work reveals how toddlers' gaze behavior supports self-supervised learning of view-invariant object recognition.

翻译：由于同一物体在不同视角下的投影存在显著差异，机器学习算法难以识别不同视角下的同一物体。相比之下，幼儿几乎无需监督便能快速学会从不同视角识别物体。近期研究认为，幼儿是通过在与物体互动过程中将时间相近的视觉输入映射到相似表征来发展这种能力的。高敏锐度视觉仅存在于中央视野区域，这或许解释了为何幼儿（与成人相似）在此类互动中会持续移动视线。目前尚不清楚幼儿是否/如何通过这些眼球运动来策划其视觉体验以支持学习物体表征。本研究探索了一种受生物启发的视觉学习模型能否利用幼儿游戏过程中的注视行为来发展视角不变的物体识别能力。通过利用双人游戏期间的头戴式眼动追踪数据，我们通过裁剪以注视点为中心的图像区域来模拟幼儿的中央视野体验。该视觉流被输入基于时间的自监督学习算法。实验表明，幼儿的注视策略有助于学习不变的物体表征。分析还揭示，高敏锐度中央视野的有限尺寸对此至关重要。我们进一步发现，与成人相比，幼儿的视觉体验能引发更鲁棒的表征，这主要源于幼儿会更长时间注视自己手持的物体。总体而言，本研究揭示了幼儿注视行为如何支持视角不变物体识别的自监督学习。