Early in development, infants learn to extract surprisingly complex aspects of visual scenes. This early learning comes together with an initial understanding of the extracted concepts, such as their implications, causality, and using them to predict likely future events. In many cases, this learning is obtained with little or no supervision, and from relatively few examples, compared to current network models. Empirical studies of visual perception in early development have shown that in the domain of objects and human-object interactions, early-acquired concepts are often used in the process of learning additional, more complex concepts. In the current work, we model how early-acquired concepts are used in the learning of subsequent concepts, and compare the results with standard deep network modeling. We focused in particular on the use of the concepts of animacy and goal attribution in learning to predict future events in dynamic visual scenes. We show that the use of early concepts in the learning of new concepts leads to better learning (higher accuracy) and more efficient learning (requiring less data), and that the combination of early and new concepts shapes the representation of the concepts acquired by the model and improves its generalization. We further compare advanced vision-language models to a human study in a task that requires an understanding of the behavior of animate vs. inanimate agents, with results supporting the contribution of early concepts to visual understanding. We finally briefly discuss the possible benefits of incorporating aspects of human-like visual learning into computer vision models.
翻译:在发育早期,婴儿便能从视觉场景中提取出令人惊讶的复杂特征。这种早期学习伴随着对所提取概念的初步理解,例如其蕴含意义、因果关系,以及利用这些概念预测未来可能发生的事件。与当前网络模型相比,这种学习通常在无监督或极弱监督条件下完成,且所需样本量相对较少。针对早期发育中视觉感知的实证研究表明,在物体与人-物交互领域,早期习得的概念常被用于学习其他更复杂概念的过程中。本研究通过建模分析了早期习得概念如何用于后续概念学习,并与标准深度网络模型进行了对比。我们特别聚焦于将"生命性"和"目标归因"概念应用于动态视觉场景中未来事件预测的学习过程。研究显示,将早期概念融入新概念学习不仅能提升学习效果(更高准确率),还能提高学习效率(减少数据需求),同时早期概念与新概念的组合会重塑模型所习得概念的表征方式,并增强其泛化能力。我们进一步将先进的视觉-语言模型与人类行为研究进行对比,通过要求理解生命体与非生命体行为差异的任务,验证了早期概念对视觉理解的促进作用。最后,简要讨论了将类人视觉学习机制融入计算机视觉模型的潜在优势。