Visual question answering (VQA) has been intensively studied as a multimodal task that requires effort in bridging vision and language to infer answers correctly. Recent attempts have developed various attention-based modules for solving VQA tasks. However, the performance of model inference is largely bottlenecked by visual processing for semantics understanding. Most existing detection methods rely on bounding boxes, remaining a serious challenge for VQA models to understand the causal nexus of object semantics in images and correctly infer contextual information. To this end, we propose a finer model framework without bounding boxes in this work, termed Looking Out of Instance Semantics (LOIS) to tackle this important issue. LOIS enables more fine-grained feature descriptions to produce visual facts. Furthermore, to overcome the label ambiguity caused by instance masks, two types of relation attention modules: 1) intra-modality and 2) inter-modality, are devised to infer the correct answers from the different multi-view features. Specifically, we implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information. In addition, our proposed attention model can further analyze salient image regions by focusing on important word-related questions. Experimental results on four benchmark VQA datasets prove that our proposed method has favorable performance in improving visual reasoning capability.
翻译:视觉问答(VQA)作为一项多模态任务,需要弥合视觉与语言之间的鸿沟以准确推断答案,近年来受到广泛研究。现有尝试开发了多种基于注意力的模块来解决VQA任务。然而,模型推理的性能在很大程度上受限于语义理解的视觉处理环节。大多数现有检测方法依赖边界框,这使VQA模型在理解图像中对象语义的因果关联并正确推断上下文信息时面临严峻挑战。为此,本文提出一种无需边界框的更精细模型框架,称为“实例语义外延”(LOIS),以解决这一重要问题。LOIS能够生成更细粒度的特征描述,从而产生视觉事实。此外,为克服实例掩码导致的标签歧义,本文设计了两种关系注意力模块:1)模态内注意力模块与2)模态间注意力模块,从不同的多视角特征中推断正确答案。具体而言,我们实现了互关联注意力模块,用于建模实例对象与背景信息之间复杂而深层的视觉语义关系。同时,所提出的注意力模型可通过聚焦与关键词语相关的问题,进一步分析显著图像区域。在四个基准VQA数据集上的实验结果表明,所提方法在提升视觉推理能力方面具有优越性能。