Recently, neural module networks (NMNs) have yielded ongoing success in answering compositional visual questions, especially those involving multi-hop visual and logical reasoning. NMNs decompose the complex question into several sub-tasks using instance-modules from the reasoning paths of that question and then exploit intermediate supervisions to guide answer prediction, thereby improving inference interpretability. However, their performance may be hindered due to sketchy modeling of intermediate supervisions. For instance, (1) a prior assumption that each instance-module refers to only one grounded object yet overlooks other potentially associated grounded objects, impeding full cross-modal alignment learning; (2) IoU-based intermediate supervisions may introduce noise signals as the bounding box overlap issue might guide the model's focus towards irrelevant objects. To address these issues, a novel method, \textbf{\underline{D}}etection-based \textbf{\underline{I}}ntermediate \textbf{\underline{S}}upervision (DIS), is proposed, which adopts a generative detection framework to facilitate multiple grounding supervisions via sequence generation. As such, DIS offers more comprehensive and accurate intermediate supervisions, thereby boosting answer prediction performance. Furthermore, by considering intermediate results, DIS enhances the consistency in answering compositional questions and their sub-questions.Extensive experiments demonstrate the superiority of our proposed DIS, showcasing both improved accuracy and state-of-the-art reasoning consistency compared to prior approaches.
翻译:最近,神经模块网络(NMNs)在回答组合性视觉问题方面持续取得成功,特别是那些涉及多跳视觉和逻辑推理的问题。NMNs利用问题推理路径中的实例模块将复杂问题分解为若干子任务,并借助中间监督来指导答案预测,从而提升推理的可解释性。然而,由于对中间监督的建模过于粗略,其性能可能受到制约。例如:(1)预先假设每个实例模块仅指向一个接地对象,却忽略了其他潜在关联的接地对象,阻碍了跨模态对齐学习的完整性;(2)基于IoU的中间监督可能引入噪声信号,因为边界框重叠问题可能使模型关注不相关的对象。为解决这些问题,本文提出了一种新颖方法——基于检测的中间监督(DIS),该方法采用生成式检测框架,通过序列生成实现多重接地监督。因此,DIS提供了更全面、更准确的中间监督,从而提升了答案预测性能。此外,通过考虑中间结果,DIS增强了回答组合性问题及其子问题的一致性。大量实验表明,所提出的DIS具有优越性,与先前方法相比,在准确性和推理一致性方面均达到了前沿水平。