Existing popular video captioning benchmarks and models deal with generic captions devoid of specific person, place or organization named entities. In contrast, news videos present a challenging setting where the caption requires such named entities for meaningful summarization. As such, we propose the task of summarizing news video directly to entity-aware captions. We also release a large-scale dataset, VIEWS (VIdeo NEWS), to support research on this task. Further, we propose a method that augments visual information from videos with context retrieved from external world knowledge to generate entity-aware captions. We demonstrate the effectiveness of our approach on three video captioning models. We also show that our approach generalizes to existing news image captions dataset. With all the extensive experiments and insights, we believe we establish a solid basis for future research on this challenging task.
翻译:现有主流视频标注基准和模型生成的通用标注缺乏特定人物、地点或组织名称等命名实体。然而,新闻视频呈现了一个具有挑战性的场景——其标注需要引入这些命名实体才能实现有意义的摘要。为此,我们提出了一项新任务:直接为新闻视频生成实体感知式标注。同时,我们发布了一个大规模数据集VIEWS(视频新闻)以支持该任务研究。进一步地,我们提出了一种方法,通过从外部世界知识中检索上下文来增强视频视觉信息,从而生成实体感知式标注。我们在三个视频标注模型上验证了该方法的有效性,并证明该方法可泛化至现有新闻图像标注数据集。基于大量实验与洞见,我们相信该工作为这一挑战性任务的未来研究奠定了坚实基础。