Human capabilities in understanding visual relations are far superior to those of AI systems, especially for previously unseen objects. For example, while AI systems struggle to determine whether two such objects are visually the same or different, humans can do so with ease. Active vision theories postulate that the learning of visual relations is grounded in actions that we take to fixate objects and their parts by moving our eyes. In particular, the low-dimensional spatial information about the corresponding eye movements is hypothesized to facilitate the representation of relations between different image parts. Inspired by these theories, we develop a system equipped with a novel Glimpse-based Active Perception (GAP) that sequentially glimpses at the most salient regions of the input image and processes them at high resolution. Importantly, our system leverages the locations stemming from the glimpsing actions, along with the visual content around them, to represent relations between different parts of the image. The results suggest that the GAP is essential for extracting visual relations that go beyond the immediate visual content. Our approach reaches state-of-the-art performance on several visual reasoning tasks being more sample-efficient, and generalizing better to out-of-distribution visual inputs than prior models.
翻译:人类在理解视觉关系方面的能力远优于人工智能系统,尤其是在处理未见过的物体时。例如,当AI系统难以判断两个此类物体在视觉上是否相同时,人类却能轻松完成。主动视觉理论认为,视觉关系的学习根植于我们通过移动眼睛来注视物体及其部件的动作。特别是,与眼球运动相关的低维空间信息被认为有助于表征不同图像部分之间的关系。受这些理论启发,我们开发了一个配备新型基于瞥视的主动感知(GAP)的系统,该系统能顺序瞥视输入图像中最显著的区域并以高分辨率处理它们。重要的是,我们的系统利用源自瞥视动作的位置信息及其周围的视觉内容,来表征图像不同部分之间的关系。结果表明,GAP对于提取超越即时视觉内容的视觉关系至关重要。我们的方法在多项视觉推理任务上达到了最先进的性能,与先前模型相比具有更高的样本效率,并对分布外视觉输入展现出更好的泛化能力。