Detection-based Intermediate Supervision for Visual Question Answering

Recently, neural module networks (NMNs) have yielded ongoing success in answering compositional visual questions, especially those involving multi-hop visual and logical reasoning. NMNs decompose the complex question into several sub-tasks using instance-modules from the reasoning paths of that question and then exploit intermediate supervisions to guide answer prediction, thereby improving inference interpretability. However, their performance may be hindered due to sketchy modeling of intermediate supervisions. For instance, (1) a prior assumption that each instance-module refers to only one grounded object yet overlooks other potentially associated grounded objects, impeding full cross-modal alignment learning; (2) IoU-based intermediate supervisions may introduce noise signals as the bounding box overlap issue might guide the model's focus towards irrelevant objects. To address these issues, a novel method, \textbf{\underline{D}}etection-based \textbf{\underline{I}}ntermediate \textbf{\underline{S}}upervision (DIS), is proposed, which adopts a generative detection framework to facilitate multiple grounding supervisions via sequence generation. As such, DIS offers more comprehensive and accurate intermediate supervisions, thereby boosting answer prediction performance. Furthermore, by considering intermediate results, DIS enhances the consistency in answering compositional questions and their sub-questions.Extensive experiments demonstrate the superiority of our proposed DIS, showcasing both improved accuracy and state-of-the-art reasoning consistency compared to prior approaches.

翻译：最近，神经模块网络（NMNs）在回答组合性视觉问题方面持续取得成功，特别是那些涉及多跳视觉和逻辑推理的问题。NMNs利用问题推理路径中的实例模块将复杂问题分解为若干子任务，并借助中间监督来指导答案预测，从而提升推理的可解释性。然而，由于对中间监督的建模过于粗略，其性能可能受到制约。例如：（1）预先假设每个实例模块仅指向一个接地对象，却忽略了其他潜在关联的接地对象，阻碍了跨模态对齐学习的完整性；（2）基于IoU的中间监督可能引入噪声信号，因为边界框重叠问题可能使模型关注不相关的对象。为解决这些问题，本文提出了一种新颖方法——基于检测的中间监督（DIS），该方法采用生成式检测框架，通过序列生成实现多重接地监督。因此，DIS提供了更全面、更准确的中间监督，从而提升了答案预测性能。此外，通过考虑中间结果，DIS增强了回答组合性问题及其子问题的一致性。大量实验表明，所提出的DIS具有优越性，与先前方法相比，在准确性和推理一致性方面均达到了前沿水平。

相关内容

DIS

关注 1

ACM Conference on Designing Interactive Systems，即ACM SIGCHI交互系统设计会议（DIS），是一个顶级的国际舞台，在这里，设计师、艺术家、心理学家、用户体验研究人员、系统工程师以及更多人聚集在一起，讨论并塑造交互系统设计和实践的未来。DIS归ACM计算机与人交互特别兴趣小组（SIGCHI）所有。官网链接：http://dis2019.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日