StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification

Existing large vision-language models (LVLMs) are largely limited to processing short, seconds-long videos and struggle with generating coherent descriptions for extended video spanning minutes or more. Long video description introduces new challenges, such as plot-level consistency across descriptions. To address these, we figure out audio-visual character identification, matching character names to each dialogue, as a key factor. We propose StoryTeller, a system for generating dense descriptions of long videos, incorporating both low-level visual concepts and high-level plot information. StoryTeller uses a multimodal large language model that integrates visual, audio, and text modalities to perform audio-visual character identification on minute-long video clips. The results are then fed into a LVLM to enhance consistency of video description. We validate our approach on movie description tasks and introduce MovieStory101, a dataset with dense descriptions for three-minute movie clips. To evaluate long video descriptions, we create MovieQA, a large set of multiple-choice questions for the MovieStory101 test set. We assess descriptions by inputting them into GPT-4 to answer these questions, using accuracy as an automatic evaluation metric. Experiments show that StoryTeller outperforms all open and closed-source baselines on MovieQA, achieving 9.5% higher accuracy than the strongest baseline, Gemini-1.5-pro, and demonstrating a +15.56% advantage in human side-by-side evaluations. Additionally, incorporating audio-visual character identification from StoryTeller improves the performance of all video description models, with Gemini-1.5-pro and GPT-4o showing relative improvement of 5.5% and 13.0%, respectively, in accuracy on MovieQA.

翻译：现有的大型视觉语言模型主要局限于处理数秒长的短视频，难以对持续数分钟或更长的扩展视频生成连贯的描述。长视频描述带来了新的挑战，例如描述间的情节层面一致性。为解决这些问题，我们确定视听角色识别——即将角色名称与每段对话进行匹配——是一个关键因素。我们提出了StoryTeller，一个用于生成长视频密集描述的系统，该系统融合了低层视觉概念和高层情节信息。StoryTeller采用一个多模态大语言模型，该模型整合了视觉、音频和文本模态，以对分钟级视频片段执行视听角色识别。识别结果随后被输入到一个大型视觉语言模型中，以增强视频描述的一致性。我们在电影描述任务上验证了我们的方法，并引入了MovieStory101数据集，该数据集包含对三分钟电影片段的密集描述。为评估长视频描述，我们创建了MovieQA，一个针对MovieStory101测试集的大型多项选择题集。我们通过将描述输入GPT-4来回答这些问题，并使用准确率作为自动评估指标。实验表明，StoryTeller在MovieQA上优于所有开源和闭源基线模型，比最强的基线模型Gemini-1.5-pro的准确率高出9.5%，并在人工并列评估中显示出+15.56%的优势。此外，融入来自StoryTeller的视听角色识别提升了所有视频描述模型的性能，其中Gemini-1.5-pro和GPT-4o在MovieQA上的准确率分别实现了5.5%和13.0%的相对提升。