Visual scenes are composed of visual concepts and have the property of combinatorial explosion. An important reason for humans to efficiently learn from diverse visual scenes is the ability of compositional perception, and it is desirable for artificial intelligence to have similar abilities. Compositional scene representation learning is a task that enables such abilities. In recent years, various methods have been proposed to apply deep neural networks, which have been proven to be advantageous in representation learning, to learn compositional scene representations via reconstruction, advancing this research direction into the deep learning era. Learning via reconstruction is advantageous because it may utilize massive unlabeled data and avoid costly and laborious data annotation. In this survey, we first outline the current progress on reconstruction-based compositional scene representation learning with deep neural networks, including development history and categorizations of existing methods from the perspectives of the modeling of visual scenes and the inference of scene representations; then provide benchmarks, including an open source toolbox to reproduce the benchmark experiments, of representative methods that consider the most extensively studied problem setting and form the foundation for other methods; and finally discuss the limitations of existing methods and future directions of this research topic.
翻译:视觉场景由视觉概念构成,并具有组合爆炸的特性。人类能够从多样化的视觉场景中高效学习的一个重要原因在于其具备组合感知能力,而人工智能也应当具备类似的能力。组合式场景表征学习正是实现这种能力的任务。近年来,大量方法将深度神经网络——已被证明在表征学习中具有显著优势——应用于通过重构学习组合式场景表征,推动该研究方向进入深度学习时代。基于重构的学习方法具有显著优势,因为它能利用海量无标注数据,避免昂贵且费力的数据标注工作。本综述首先梳理了当前基于重构的深度神经网络组合式场景表征学习的研究进展,包括发展历程以及从视觉场景建模和场景表征推理角度对现有方法的分类;随后给出了代表性方法的基准测试(含可复现基准实验的开源工具箱),这些方法覆盖了研究最为广泛的问题设定,并为其他方法提供了基础架构;最后探讨了现有方法的局限性及该研究主题的未来方向。