Existing large vision-language models (LVLMs) are largely limited to processing short, seconds-long videos and struggle with generating coherent descriptions for extended video spanning minutes or more. Long video description introduces new challenges, such as plot-level consistency across descriptions. To address these, we figure out audio-visual character identification, matching character names to each dialogue, as a key factor. We propose StoryTeller, a system for generating dense descriptions of long videos, incorporating both low-level visual concepts and high-level plot information. StoryTeller uses a multimodal large language model that integrates visual, audio, and text modalities to perform audio-visual character identification on minute-long video clips. The results are then fed into a LVLM to enhance consistency of video description. We validate our approach on movie description tasks and introduce MovieStory101, a dataset with dense descriptions for three-minute movie clips. To evaluate long video descriptions, we create MovieQA, a large set of multiple-choice questions for the MovieStory101 test set. We assess descriptions by inputting them into GPT-4 to answer these questions, using accuracy as an automatic evaluation metric. Experiments show that StoryTeller outperforms all open and closed-source baselines on MovieQA, achieving 9.5% higher accuracy than the strongest baseline, Gemini-1.5-pro, and demonstrating a +15.56% advantage in human side-by-side evaluations. Additionally, incorporating audio-visual character identification from StoryTeller improves the performance of all video description models, with Gemini-1.5-pro and GPT-4o showing relative improvement of 5.5% and 13.0%, respectively, in accuracy on MovieQA.
翻译:现有的大型视觉语言模型主要局限于处理数秒长的短视频,难以对持续数分钟或更长的扩展视频生成连贯的描述。长视频描述带来了新的挑战,例如描述间的情节层面一致性。为解决这些问题,我们确定视听角色识别——即将角色名称与每段对话进行匹配——是一个关键因素。我们提出了StoryTeller,一个用于生成长视频密集描述的系统,该系统融合了低层视觉概念和高层情节信息。StoryTeller采用一个多模态大语言模型,该模型整合了视觉、音频和文本模态,以对分钟级视频片段执行视听角色识别。识别结果随后被输入到一个大型视觉语言模型中,以增强视频描述的一致性。我们在电影描述任务上验证了我们的方法,并引入了MovieStory101数据集,该数据集包含对三分钟电影片段的密集描述。为评估长视频描述,我们创建了MovieQA,一个针对MovieStory101测试集的大型多项选择题集。我们通过将描述输入GPT-4来回答这些问题,并使用准确率作为自动评估指标。实验表明,StoryTeller在MovieQA上优于所有开源和闭源基线模型,比最强的基线模型Gemini-1.5-pro的准确率高出9.5%,并在人工并列评估中显示出+15.56%的优势。此外,融入来自StoryTeller的视听角色识别提升了所有视频描述模型的性能,其中Gemini-1.5-pro和GPT-4o在MovieQA上的准确率分别实现了5.5%和13.0%的相对提升。