Neural Module Networks (NMN) are a compelling method for visual question answering, enabling the translation of a question into a program consisting of a series of reasoning sub-tasks that are sequentially executed on the image to produce an answer. NMNs provide enhanced explainability compared to integrated models, allowing for a better understanding of the underlying reasoning process. To improve the effectiveness of NMNs we propose to exploit features obtained by a large-scale cross-modal encoder. Also, the current training approach of NMNs relies on the propagation of module outputs to subsequent modules, leading to the accumulation of prediction errors and the generation of false answers. To mitigate this, we introduce an NMN learning strategy involving scheduled teacher guidance. Initially, the model is fully guided by the ground-truth intermediate outputs, but gradually transitions to an autonomous behavior as training progresses. This reduces error accumulation, thus improving training efficiency and final performance.We demonstrate that by incorporating cross-modal features and employing more effective training techniques for NMN, we achieve a favorable balance between performance and transparency in the reasoning process.
翻译:神经模块网络(NMN)是一种用于视觉问答的有效方法,它能够将问题转化为一个由一系列推理子任务组成的程序,这些子任务在图像上顺序执行以生成答案。与集成模型相比,NMN具有更强的可解释性,有助于更好地理解底层推理过程。为提升NMN的有效性,我们提出利用大规模跨模态编码器获得的特征。此外,当前NMN的训练方法依赖模块输出传播至后续模块,这会导致预测误差累积并产生错误答案。为缓解这一问题,我们引入了一种包含调度式教师引导的NMN学习策略。初始阶段,模型完全由真实中间输出引导,但随着训练进行逐步过渡到自主行为。这减少了误差累积,从而提高了训练效率和最终性能。我们证明,通过融合跨模态特征并采用更有效的NMN训练技术,我们在推理过程的性能与透明度之间实现了有利平衡。