Transformers achieve great performance on Visual Question Answering (VQA). However, their systematic generalization capabilities, i.e., handling novel combinations of known concepts, is unclear. We reveal that Neural Module Networks (NMNs), i.e., question-specific compositions of modules that tackle a sub-task, achieve better or similar systematic generalization performance than the conventional Transformers, even though NMNs' modules are CNN-based. In order to address this shortcoming of Transformers with respect to NMNs, in this paper we investigate whether and how modularity can bring benefits to Transformers. Namely, we introduce Transformer Module Network (TMN), a novel NMN based on compositions of Transformer modules. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, improving more than 30% over standard Transformers for novel compositions of sub-tasks. We show that not only the module composition but also the module specialization for each sub-task are the key of such performance gain.
翻译:Transformer在视觉问答(VQA)任务上取得了优异性能。然而,其系统性泛化能力(即处理已知概念的新颖组合)尚不明确。我们发现,神经模块网络(NMNs)——即针对子任务进行模块化组合的问答特定结构——即使其模块基于CNN,其系统性泛化性能仍优于或接近传统Transformer。为解决Transformer相较于NMNs的这一不足,本文探究模块化能否及如何为Transformer带来优势。为此,我们提出Transformer模块网络(TMN),这是一种基于Transformer模块组合的新型NMN。TMN在三个VQA数据集上实现了最先进的系统性泛化性能,在子任务的新颖组合上比标准Transformer提升超过30%。我们证明,不仅模块组合本身,各子任务的模块专业化也是实现这一性能提升的关键。