In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content. Different from existing datasets where the answer is compact and short, InfoVisDial contains long free-form answers with rich information in each round of dialogue. For effective data collection, the key idea is to bridge the large-scale multimodal model (e.g., GIT) and the language models (e.g., GPT-3). GIT can describe the image content even with scene text, while GPT-3 can generate informative dialogue based on the image description and appropriate prompting techniques. With such automatic pipeline, we can readily generate informative visual dialogue data at scale. Then, we ask human annotators to rate the generated dialogues to filter the low-quality conversations.Human analyses show that InfoVisDial covers informative and diverse dialogue topics: $54.4\%$ of the dialogue rounds are related to image scene texts, and $36.7\%$ require external knowledge. Each round's answer is also long and open-ended: $87.3\%$ of answers are unique with an average length of $8.9$, compared with $27.37\%$ and $2.9$ in VisDial. Last, we propose a strong baseline by adapting the GIT model for the visual dialogue task and fine-tune the model on InfoVisDial. Hopefully, our work can motivate more effort on this direction.
翻译:本文构建了一个名为InfoVisDial的视觉对话数据集,该数据集在每一轮对话中均能提供包含丰富信息量的回答,甚至包括与视觉内容相关的外部知识。与现有数据集中回答简洁短小的特点不同,InfoVisDial的每轮对话包含带有丰富信息的长篇自由形式回答。为实现高效数据收集,关键思路在于桥接大规模多模态模型(如GIT)与语言模型(如GPT-3)。GIT能够描述包含场景文本的图像内容,而GPT-3可基于图像描述及恰当的提示技术生成信息量丰富的对话。借助这一自动化流程,我们能够便捷地大规模生成信息型视觉对话数据。随后,我们邀请人工标注者对生成的对话进行评分,以过滤低质量对话。人工分析表明,InfoVisDial涵盖了信息丰富且多样化的对话主题:54.4%的对话轮次与图像场景文本相关,36.7%需要外部知识。每轮回答同样呈现长文本与开放型特征:87.3%的回答为唯一回答,平均长度为8.9个词,而VisDial数据集中对应比例仅为27.37%且平均长度为2.9个词。最后,我们通过适配GIT模型用于视觉对话任务,并在InfoVisDial上微调模型,提出了一个强基线方法。希望本研究能推动该方向的进一步探索。