The visual question generation (VQG) task aims to generate human-like questions from an image and potentially other side information (e.g. answer type). Previous works on VQG fall in two aspects: i) They suffer from one image to many questions mapping problem, which leads to the failure of generating referential and meaningful questions from an image. ii) They fail to model complex implicit relations among the visual objects in an image and also overlook potential interactions between the side information and image. To address these limitations, we first propose a novel learning paradigm to generate visual questions with answer-awareness and region-reference. Concretely, we aim to ask the right visual questions with Double Hints - textual answers and visual regions of interests, which could effectively mitigate the existing one-to-many mapping issue. Particularly, we develop a simple methodology to self-learn the visual hints without introducing any additional human annotations. Furthermore, to capture these sophisticated relationships, we propose a new double-hints guided Graph-to-Sequence learning framework, which first models them as a dynamic graph and learns the implicit topology end-to-end, and then utilizes a graph-to-sequence model to generate the questions with double hints. Experimental results demonstrate the priority of our proposed method.
翻译:视觉问题生成(VQG)任务旨在从图像及可能的辅助信息(如答案类型)中生成类人问题。现有VQG研究存在两方面不足:i) 其受限于单图像到多问题的映射困境,导致难以从图像生成具有参照性和意义的问题;ii) 未能充分建模图像中视觉对象间复杂的隐含关系,且忽视了辅助信息与图像间的潜在交互。为突破这些局限,本文首次提出一种具备答案感知与区域参照能力的新型视觉问题生成学习范式。具体而言,我们通过文本答案与视觉关注区域构成的"双重提示"来生成精准的视觉问题,从而有效缓解现有的一对多映射问题。特别地,我们开发了一种无需额外人工标注的自学习视觉提示方法。此外,为捕捉这些复杂关联,我们提出双重提示引导的图到序列学习框架:首先将关系建模为动态图并进行端到端隐式拓扑学习,随后采用图到序列模型生成含双重提示的问题。实验结果验证了所提方法的优越性。