Visual Question Answering (VQA) has emerged as a highly engaging field in recent years, attracting increasing research efforts aiming to enhance VQA accuracy through the deployment of advanced models such as Transformers. Despite this growing interest, there has been limited exploration into the comparative analysis and impact of textual modalities within VQA, particularly in terms of model complexity and its effect on performance. In this work, we conduct a comprehensive comparison between complex textual models that leverage long dependency mechanisms and simpler models focusing on local textual features within a well-established VQA framework. Our findings reveal that employing complex textual encoders is not invariably the optimal approach for the VQA-v2 dataset. Motivated by this insight, we introduce an improved model, ConvGRU, which incorporates convolutional layers to enhance the representation of question text. Tested on the VQA-v2 dataset, ConvGRU achieves better performance without substantially increasing parameter complexity.
翻译:视觉问答(VQA)近年来已成为一个极具吸引力的研究领域,吸引了越来越多的研究投入,旨在通过部署Transformer等先进模型提升VQA准确率。尽管关注度与日俱增,但针对VQA中文本模态的比较分析及影响——特别是模型复杂度对性能的影响——探索仍十分有限。本研究在成熟的VQA框架下,对采用长依赖机制的复杂文本模型与聚焦局部文本特征的简单模型进行了全面比较。研究结果表明,在VQA-v2数据集上,使用复杂文本编码器并非始终是最优方案。基于这一发现,我们提出了一种改进模型ConvGRU,通过引入卷积层增强问题文本的表征能力。在VQA-v2数据集上的测试显示,ConvGRU在未显著增加参数复杂度的情况下取得了更优性能。