Large language models such as GPT-3 have demonstrated an impressive capability to adapt to new tasks without requiring task-specific training data. This capability has been particularly effective in settings such as narrative question answering, where the diversity of tasks is immense, but the available supervision data is small. In this work, we investigate if such language models can extend their zero-shot reasoning abilities to long multimodal narratives in multimedia content such as drama, movies, and animation, where the story plays an essential role. We propose Long Story Short, a framework for narrative video QA that first summarizes the narrative of the video to a short plot and then searches parts of the video relevant to the question. We also propose to enhance visual matching with CLIPCheck. Our model outperforms state-of-the-art supervised models by a large margin, highlighting the potential of zero-shot QA for long videos.
翻译:摘要:GPT-3等大型语言模型展现了无需任务特定训练数据即可适应新任务的惊人能力。这一能力在叙事性问答等场景中尤为有效——此类场景任务多样性极高,但可用的监督数据却十分有限。本研究探究这类语言模型能否将其零样本推理能力延伸到多媒体内容(如戏剧、电影和动画)中的长多模态叙事——这些场景中故事情节至关重要。我们提出“长话短说”(Long Story Short)框架,该框架首先将视频叙事摘要为简短情节,随后搜索与问题相关的视频片段。我们还提出通过CLIPCheck增强视觉匹配。我们的模型以显著优势超越现有最先进的监督模型,凸显了长视频零样本问答的潜力。