Despite the recent emergence of video captioning models, how to generate vivid, fine-grained video descriptions based on the background knowledge (i.e., long and informative commentary about the domain-specific scenes with appropriate reasoning) is still far from being solved, which however has great applications such as automatic sports narrative. In this paper, we present GOAL, a benchmark of over 8.9k soccer video clips, 22k sentences, and 42k knowledge triples for proposing a challenging new task setting as Knowledge-grounded Video Captioning (KGVC). Moreover, we conduct experimental adaption of existing methods to show the difficulty and potential directions for solving this valuable and applicable task. Our data and code are available at https://github.com/THU-KEG/goal.
翻译:尽管近年来涌现出视频描述模型,但如何基于背景知识(即对特定领域场景进行适当推理的长篇、信息丰富的解说)生成生动、细粒度的视频描述仍远未解决,而这在自动体育解说等领域具有重要应用价值。本文提出GOAL基准,包含超过8900个足球视频片段、2.2万条句子和4.2万个知识三元组,旨在提出一项具有挑战性的新任务设置——知识驱动视频描述(KGVC)。此外,我们对现有方法进行实验性适配,以展示该有价值且实用任务的难点及潜在解决方向。我们的数据和代码已开源在https://github.com/THU-KEG/goal。