To help the visually impaired enjoy movies, automatic movie narrating systems are expected to narrate accurate, coherent, and role-aware plots when there are no speaking lines of actors. Existing works benchmark this challenge as a normal video captioning task via some simplifications, such as removing role names and evaluating narrations with ngram-based metrics, which makes it difficult for automatic systems to meet the needs of real application scenarios. To narrow this gap, we construct a large-scale Chinese movie benchmark, named Movie101. Closer to real scenarios, the Movie Clip Narrating (MCN) task in our benchmark asks models to generate role-aware narration paragraphs for complete movie clips where no actors are speaking. External knowledge, such as role information and movie genres, is also provided for better movie understanding. Besides, we propose a new metric called Movie Narration Score (MNScore) for movie narrating evaluation, which achieves the best correlation with human evaluation. Our benchmark also supports the Temporal Narration Grounding (TNG) task to investigate clip localization given text descriptions. For both two tasks, our proposed methods well leverage external knowledge and outperform carefully designed baselines. The dataset and codes are released at https://github.com/yuezih/Movie101.
翻译:为帮助视障人士欣赏电影,自动电影叙述系统需在无演员台词时,叙述准确、连贯且具备角色感知的剧情。现有研究通过简化(如移除角色名称、采用n-gram指标评估叙述)将此项挑战视为普通视频字幕任务,导致自动系统难以满足真实应用场景需求。为缩小这一差距,我们构建了一个大规模中文电影基准——Movie101。更贴近真实场景的是,该基准中的电影片段叙述(MCN)任务要求模型为无演员台词对白的完整电影片段生成包含角色感知的叙述段落。同时提供角色信息和电影类型等外部知识以增强电影理解。此外,我们提出一种新的电影叙述评估指标——电影叙述得分(MNScore),该指标与人工评估的相关性最优。该基准还支持时间叙述定位(TNG)任务,用于根据文本描述定位片段场景。针对这两项任务,我们提出的方法有效利用外部知识,性能显著优于精心设计的基线模型。数据集与代码已发布于 https://github.com/yuezih/Movie101。