We aim to investigate whether end-to-end learning of visual reasoning can be achieved with general-purpose neural networks, with the help of visual pretraining. A positive result would refute the common belief that explicit visual abstraction (e.g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks. We propose a simple and general self-supervised framework which "compresses" each video frame into a small set of tokens with a transformer network, and reconstructs the remaining frames based on the compressed temporal context. To minimize the reconstruction loss, the network must learn a compact representation for each image, as well as capture temporal dynamics and object permanence from temporal context. We perform evaluation on two visual reasoning benchmarks, CATER and ACRE. We observe that pretraining is essential to achieve compositional generalization for end-to-end visual reasoning. Our proposed framework outperforms traditional supervised pretraining, including image classification and explicit object detection, by large margins.
翻译:我们旨在探究通用神经网络是否能在视觉预训练的辅助下实现端到端的视觉推理学习。一个积极的结果将反驳主流观点,即显式的视觉抽象(例如目标检测)对视觉推理中的组合泛化至关重要,并证实神经网络“通才”同时解决视觉识别与推理任务的可行性。我们提出了一种简单通用的自监督框架:通过Transformer网络将每个视频帧“压缩”为一小组令牌,并基于压缩后的时序上下文重建其余帧。为了最小化重建损失,网络必须学习每个图像的紧凑表示,同时从时序上下文中捕捉时间动态与物体恒常性。我们在两个视觉推理基准——CATER和ACRE——上进行了评估。研究发现,预训练对于端到端视觉推理实现组合泛化至关重要。我们提出的框架在性能上大幅超越了传统的监督预训练方法(包括图像分类与显式目标检测)。