We introduce Affective Visual Dialog, an emotion explanation and reasoning task as a testbed for research on understanding the formation of emotions in visually grounded conversations. The task involves three skills: (1) Dialog-based Question Answering (2) Dialog-based Emotion Prediction and (3) Affective emotion explanation generation based on the dialog. Our key contribution is the collection of a large-scale dataset, dubbed AffectVisDial, consisting of 50K 10-turn visually grounded dialogs as well as concluding emotion attributions and dialog-informed textual emotion explanations, resulting in a total of 27,180 working hours. We explain our design decisions in collecting the dataset and introduce the questioner and answerer tasks that are associated with the participants in the conversation. We train and demonstrate solid Affective Visual Dialog baselines adapted from state-of-the-art models. Remarkably, the responses generated by our models show promising emotional reasoning abilities in response to visually grounded conversations. Our project page is available at https://affective-visual-dialog.github.io.
翻译:我们提出情感视觉对话(Affective Visual Dialog)这一情感解释与推理任务,将其作为研究视觉情境对话中情感形成机制的标准测试平台。该任务包含三项核心能力:(1)基于对话的问答;(2) 基于对话的情感预测;(3) 基于对话生成情感解释文本。我们的核心贡献是构建了名为AffectVisDial的大规模数据集,包含5万组10轮视觉情境对话及对应的情感归因标注与对话驱动的文本情感解释,累计耗时27,180个工时。我们详细阐述了数据集构建的设计决策,并介绍了与对话参与者关联的提问者与回答者任务。基于当前最优模型适配训练,我们构建并验证了可靠的情感视觉对话基线系统。值得注意的是,模型生成的回应展现出针对视觉情境对话的显著情感推理能力。项目主页请访问 https://affective-visual-dialog.github.io。