Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics extracted from weakly aligned video and dialogues. Most existing benchmarks treat both modalities the same as a frame-independent visual understanding task, while neglecting the intrinsic attributes in multimodal dialogues, such as scene and topic transitions. In this paper, we present Video-grounded Scene&Topic AwaRe dialogue (VSTAR) dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series. Based on VSTAR, we propose two benchmarks for video-grounded dialogue understanding: scene segmentation and topic segmentation, and one benchmark for video-grounded dialogue generation. Comprehensive experiments are performed on these benchmarks to demonstrate the importance of multimodal information and segments in video-grounded dialogue understanding and generation.
翻译:视频对话理解是一项具有挑战性的问题,要求机器能够从弱对齐的视频与对话中感知、解析和推理情境语义。现有的大多数基准将两种模态同等对待,视作帧独立的视觉理解任务,忽视了多模态对话的内在属性,如场景与主题的转换。本文提出了基于395部电视剧的大规模视频对话理解数据集——视频情境与主题感知对话(VSTAR)数据集。基于VSTAR,我们为视频对话理解设定了两个基准任务:场景分割与主题分割,以及一个视频对话生成基准任务。在这些基准上进行了全面实验,证明了多模态信息及片段在视频对话理解与生成中的重要性。