While image understanding on recognition-level has achieved remarkable advancements, reliable visual scene understanding requires comprehensive image understanding on recognition-level but also cognition-level, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge. In this paper, we propose a novel Cognitive Attention Network (CAN) for visual commonsense reasoning to achieve interpretable visual understanding. Specifically, we first introduce an image-text fusion module to fuse information from images and text collectively. Second, a novel inference module is designed to encode commonsense among image, query and response. Extensive experiments on large-scale Visual Commonsense Reasoning (VCR) benchmark dataset demonstrate the effectiveness of our approach. The implementation is publicly available at https://github.com/tanjatang/CAN
翻译:尽管在识别层面的图像理解已取得显著进展,但可靠的视觉场景理解不仅需要识别层面的全面图像理解,还需要认知层面的理解,这要求利用多源信息以及学习不同层次的理解和广泛的常识知识。本文提出一种新型的认知注意力网络(Cognitive Attention Network, CAN)用于视觉常识推理,以实现可解释的视觉理解。具体而言,我们首先引入一个图像-文本融合模块,用于共同融合图像和文本信息。其次,设计了一种新型推理模块,用于对图像、查询和响应之间的常识进行编码。在大型视觉常识推理(Visual Commonsense Reasoning, VCR)基准数据集上的大量实验证明了我们方法的有效性。该实现已在 https://github.com/tanjatang/CAN 公开提供。