Automatic movie narration targets at creating video-aligned plot descriptions to assist visually impaired audiences. It differs from standard video captioning in that it requires not only describing key visual details but also inferring the plots developed across multiple movie shots, thus posing unique and ongoing challenges. To advance the development of automatic movie narrating systems, we first revisit the limitations of existing datasets and develop a large-scale, bilingual movie narration dataset, Movie101v2. Second, taking into account the essential difficulties in achieving applicable movie narration, we break the long-term goal into three progressive stages and tentatively focus on the initial stages featuring understanding within individual clips. We also introduce a new narration assessment to align with our staged task goals. Third, using our new dataset, we baseline several leading large vision-language models, including GPT-4V, and conduct in-depth investigations into the challenges current models face for movie narration generation. Our findings reveal that achieving applicable movie narration generation is a fascinating goal that requires thorough research.
翻译:自动电影旁白旨在生成与视频对齐的情节描述,以辅助视障观众。与标准视频字幕不同,它不仅需要描述关键视觉细节,还需推断跨多个电影镜头展开的情节,因此构成了独特且持续的挑战。为推动自动电影旁白系统的发展,我们首先审视了现有数据集的局限性,并构建了一个大规模的双语电影旁白数据集Movie101v2。其次,考虑到实现实用电影旁白的关键难点,我们将长期目标分解为三个渐进阶段,并初步聚焦于理解单个片段内的初始阶段。我们还引入了一种新的旁白评估标准,以契合阶段性任务目标。第三,利用新数据集,我们对多个领先的大规模视觉语言模型(包括GPT-4V)进行了基线测试,并深入探究了当前模型在电影旁白生成中面临的挑战。研究结果表明,实现实用的电影旁白生成是一个需要深入研究的迷人目标。