Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In this paper, we study how to leverage them for zero-shot visual question answering (VQA). Our approach is motivated by a few observations. First, VQA questions often require multiple steps of reasoning, which is still a capability that most PTMs lack. Second, different steps in VQA reasoning chains require different skills such as object detection and relational reasoning, but a single PTM may not possess all these skills. Third, recent work on zero-shot VQA does not explicitly consider multi-step reasoning chains, which makes them less interpretable compared with a decomposition-based approach. We propose a modularized zero-shot network that explicitly decomposes questions into sub reasoning steps and is highly interpretable. We convert sub reasoning tasks to acceptable objectives of PTMs and assign tasks to proper PTMs without any adaptation. Our experiments on two VQA benchmarks under the zero-shot setting demonstrate the effectiveness of our method and better interpretability compared with several baselines.
翻译:大规模预训练模型(PTMs)展现出强大的零样本能力。在本文中,我们研究如何利用它们进行零样本视觉问答(VQA)。我们的方法源于以下几点观察:首先,VQA问题通常需要多步推理,这仍是大多数PTMs所缺乏的能力。其次,VQA推理链的不同步骤需要不同的技能,如目标检测和关系推理,但单一的PTM可能不具备所有这些技能。第三,现有的零样本VQA工作并未明确考虑多步推理链,这使得它们与基于分解的方法相比可解释性较差。我们提出了一种模块化的零样本网络,该网络将问题显式分解为多个子推理步骤,且具有高度可解释性。我们将子推理任务转换为PTMs可接受的优化目标,并在无需任何调整的情况下将任务分配给合适的PTMs。在零样本设置下对两个VQA基准进行的实验表明,与多种基线方法相比,我们的方法具有更好的有效性及可解释性。