Growing interest in conversational agents promote twoway human-computer communications involving asking and answering visual questions have become an active area of research in AI. Thus, generation of visual questionanswer pair(s) becomes an important and challenging task. To address this issue, we propose a weakly-supervised visual question answer generation method that generates a relevant question-answer pairs for a given input image and associated caption. Most of the prior works are supervised and depend on the annotated question-answer datasets. In our work, we present a weakly supervised method that synthetically generates question-answer pairs procedurally from visual information and captions. The proposed method initially extracts list of answer words, then does nearest question generation that uses the caption and answer word to generate synthetic question. Next, the relevant question generator converts the nearest question to relevant language question by dependency parsing and in-order tree traversal, finally, fine-tune a ViLBERT model with the question-answer pair(s) generated at end. We perform an exhaustive experimental analysis on VQA dataset and see that our model significantly outperform SOTA methods on BLEU scores. We also show the results wrt baseline models and ablation study.
翻译:对话智能体的研究热潮推动了涉及视觉问题的双向人机交互,这已成为人工智能领域的活跃研究方向。因此,视觉问答对的生成成为一项重要且具有挑战性的任务。为解决这一问题,我们提出了一种弱监督的视觉问答生成方法,该方法可为给定的输入图像及其关联描述生成相关的问题-答案对。现有工作大多采用监督学习,依赖于已标注的问答数据集。在本研究中,我们提出一种弱监督方法,通过从视觉信息和描述中程序化合成问题-答案对。该方法首先提取候选答案词列表,随后执行最近邻问题生成——利用描述和答案词合成初步问题。接着,通过依存句法分析与中序树遍历,将最近邻问题转化为符合自然语言表达的关联问题。最终,使用生成的问题-答案对微调ViLBERT模型。我们在VQA数据集上进行了详尽的实验分析,结果表明,我们的模型在BLEU评分上显著优于现有最先进方法。此外,我们还展示了与基线模型的对比结果及消融实验分析。