Text-based VQA aims at answering questions by reading the text present in the images. It requires a large amount of scene-text relationship understanding compared to the VQA task. Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image but less importance is given to visual features and some questions do not require understanding the image. The models trained on this dataset predict biased answers due to the lack of understanding of visual context. For example, in questions like "What is written on the signboard?", the answer predicted by the model is always "STOP" which makes the model to ignore the image. To address these issues, we propose a method to learn visual features (making V matter in TextVQA) along with the OCR features and question features using VQA dataset as external knowledge for Text-based VQA. Specifically, we combine the TextVQA dataset and VQA dataset and train the model on this combined dataset. Such a simple, yet effective approach increases the understanding and correlation between the image features and text present in the image, which helps in the better answering of questions. We further test the model on different datasets and compare their qualitative and quantitative results.
翻译:基于文本的VQA旨在通过读取图像中的文本来回答问题,相较于常规VQA任务,它需要更强的场景-文本关系理解能力。近期研究表明,该数据集中问题-答案对的焦点更集中于图像中的文本,而视觉特征的重要性被低估,部分问题甚至无需理解图像内容。由于缺乏对视觉上下文的认知,基于该数据集训练的模型会预测出带有偏见的答案。例如,在"指示牌上写着什么?"这类问题中,模型始终预测答案为"STOP",导致其完全忽视图像信息。为解决此问题,我们提出一种利用VQA数据集作为外部知识的方法,在OCR特征和问题特征之外学习视觉特征(即使得视觉特征在TextVQA中发挥关键作用)。具体而言,我们将TextVQA数据集与VQA数据集融合,并在该混合数据集上训练模型。这种简单而有效的方法增强了图像特征与图像中文本的关联性,从而提升问答性能。我们还跨数据集测试该模型,并对比其定性与定量结果。