There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video. Several studies introduce methods by designing dense video captioning as a multitasking problem of event localization and event captioning to consider inter-task relations. However, addressing both tasks using only visual input is challenging due to the lack of semantic content. In this study, we address this by proposing a novel framework inspired by the cognitive information processing of humans. Our model utilizes external memory to incorporate prior knowledge. The memory retrieval method is proposed with cross-modal video-to-text matching. To effectively incorporate retrieved text features, the versatile encoder and the decoder with visual and textual cross-attention modules are designed. Comparative experiments have been conducted to show the effectiveness of the proposed method on ActivityNet Captions and YouCook2 datasets. Experimental results show promising performance of our model without extensive pretraining from a large video dataset.
翻译:密集视频描述旨在自动定位并描述未修剪视频中的所有事件,该研究领域已引起广泛关注。为考虑任务间关联性,多项研究通过将密集视频描述设计为事件定位与事件描述的多任务问题来引入新方法。然而,仅依靠视觉输入处理这两项任务因缺乏语义内容而颇具挑战性。本研究受人类认知信息处理机制启发,提出了一种新颖框架以解决该问题。我们的模型利用外部记忆来整合先验知识,并通过跨模态视频-文本匹配提出记忆检索方法。为有效整合检索到的文本特征,我们设计了包含视觉与文本交叉注意力模块的多功能编码器与解码器。在ActivityNet Captions和YouCook2数据集上进行的对比实验表明,所提方法具有有效性。实验结果显示,无需在大型视频数据集上进行广泛预训练,我们的模型即可获得令人满意的性能。