Visual Question Answering (VQA) is a complex task requiring large datasets and expensive training. Neural Module Networks (NMN) first translate the question to a reasoning path, then follow that path to analyze the image and provide an answer. We propose an NMN method that relies on predefined cross-modal embeddings to ``warm start'' learning on the GQA dataset, then focus on Curriculum Learning (CL) as a way to improve training and make a better use of the data. Several difficulty criteria are employed for defining CL methods. We show that by an appropriate selection of the CL method the cost of training and the amount of training data can be greatly reduced, with a limited impact on the final VQA accuracy. Furthermore, we introduce intermediate losses during training and find that this allows to simplify the CL strategy.
翻译:视觉问答(VQA)是一项复杂任务,需要大规模数据集且训练成本高昂。神经模块网络(NMN)首先将问题转化为推理路径,再沿该路径分析图像并给出答案。我们提出一种NMN方法,该方法利用预定义的跨模态嵌入在GQA数据集上进行“热启动”学习,然后聚焦于课程学习(CL)以改进训练过程并更有效地利用数据。我们采用多种难度标准定义CL方法,研究表明通过合理选择CL策略,可以在对最终VQA准确率影响有限的前提下,显著降低训练成本与所需数据量。此外,我们在训练过程中引入中间损失函数,发现这有助于简化CL策略的实施。