Despite the recent emergence of video captioning models, how to generate vivid, fine-grained video descriptions based on the background knowledge (i.e., long and informative commentary about the domain-specific scenes with appropriate reasoning) is still far from being solved, which however has great applications such as automatic sports narrative. In this paper, we present GOAL, a benchmark of over 8.9k soccer video clips, 22k sentences, and 42k knowledge triples for proposing a challenging new task setting as Knowledge-grounded Video Captioning (KGVC). Moreover, we conduct experimental adaption of existing methods to show the difficulty and potential directions for solving this valuable and applicable task.
翻译:尽管近年来视频描述模型不断涌现,但如何基于背景知识(即针对特定场景场景的、附带适当推理的长篇信息型解说)生成生动且细粒度的视频描述仍远未解决,而这一能力在自动体育叙事等领域具有重要应用价值。本文提出GOAL基准数据集,包含超过8900个足球视频片段、2.2万条句子及4.2万个知识三元组,旨在定义具有挑战性的新任务——知识驱动视频描述(KGVC)。此外,我们通过实验性适配现有方法,揭示了该有价值且实用任务的难点及潜在解决方向。