To help the visually impaired enjoy movies, automatic movie narrating systems are expected to narrate accurate, coherent, and role-aware plots when there are no speaking lines of actors. Existing works benchmark this challenge as a normal video captioning task via some simplifications, such as removing role names and evaluating narrations with ngram-based metrics, which makes it difficult for automatic systems to meet the needs of real application scenarios. To narrow this gap, we construct a large-scale Chinese movie benchmark, named Movie101. Closer to real scenarios, the Movie Clip Narrating (MCN) task in our benchmark asks models to generate role-aware narration paragraphs for complete movie clips where no actors are speaking. External knowledge, such as role information and movie genres, is also provided for better movie understanding. Besides, we propose a new metric called Movie Narration Score (MNScore) for movie narrating evaluation, which achieves the best correlation with human evaluation. Our benchmark also supports the Temporal Narration Grounding (TNG) task to investigate clip localization given text descriptions. For both two tasks, our proposed methods well leverage external knowledge and outperform carefully designed baselines. The dataset and codes are released at https://github.com/yuezih/Movie101.
翻译:为帮助视障人士欣赏电影,自动电影解说系统需在演员无台词时输出准确、连贯且具有角色意识的剧情描述。现有研究通过简化将此项挑战归为普通视频字幕任务(如移除角色名称、使用基于n-gram指标评估解说效果),这使自动系统难以满足实际应用场景需求。为缩小这一差距,我们构建了名为Movie101的大规模中文电影基准。该基准中的电影片段解说任务更贴近真实场景,要求模型为无演员说话的电影片段生成具有角色意识的解说段落,同时提供角色信息、电影类型等外部知识以增强电影理解能力。此外,我们提出新指标MNScore用于评估电影解说质量,该指标与人工评估的相关性最优。本基准还支持时间解说定位任务,可基于文本描述实现片段定位。针对两项任务,我们提出的方法均能有效利用外部知识,性能优于精心设计的基线模型。数据集与代码已发布于https://github.com/yuezih/Movie101。