Despite the recent emergence of video captioning models, how to generate the text description with specific entity names and fine-grained actions is far from being solved, which however has great applications such as basketball live text broadcast. In this paper, a new multimodal knowledge supported basketball benchmark for video captioning is proposed. Specifically, we construct a Multimodal Basketball Game Knowledge Graph (MbgKG) to provide knowledge beyond videos. Then, a Multimodal Basketball Game Video Captioning (MbgVC) dataset that contains 9 types of fine-grained shooting events and 286 players' knowledge (i.e., images and names) is constructed based on MbgKG. We develop a novel framework in the encoder-decoder form named Entity-Aware Captioner (EAC) for basketball live text broadcast. The temporal information in video is encoded by introducing the bi-directional GRU (Bi-GRU) module. And the multi-head self-attention module is utilized to model the relationships among the players and select the key players. Besides, we propose a new performance evaluation metric named Game Description Score (GDS), which measures not only the linguistic performance but also the accuracy of the names prediction. Extensive experiments on MbgVC dataset demonstrate that EAC effectively leverages external knowledge and outperforms advanced video captioning models. The proposed benchmark and corresponding codes will be publicly available soon.
翻译:尽管近期视频字幕模型取得了进展,但在生成包含特定实体名称和细粒度动作的文本描述方面仍远未解决,而这类应用(如篮球直播文字转播)具有重要价值。本文提出了一种新的多模态知识支持的篮球视频描述基准。具体而言,我们构建了一个多模态篮球比赛知识图谱,以提供视频之外的知识。在此基础上,我们构建了一个包含9种细粒度投篮事件和286名球员知识(即图像和名称)的多模态篮球比赛视频描述数据集。我们提出了一种名为实体感知字幕器的编码器-解码器形式的新型框架,用于篮球直播文字转播。通过引入双向门控循环单元模块对视频中的时序信息进行编码,并利用多头自注意力模块对球员间关系进行建模及关键球员选取。此外,我们提出了一种名为比赛描述分数的新性能评估指标,该指标不仅衡量语言表现,还衡量名称预测的准确率。在MbgVC数据集上的大量实验表明,EAC有效利用了外部知识,性能优于先进的视频字幕模型。所提出的基准及相关代码将很快公开。