Weakly-Supervised Dense Video Captioning (WSDVC) aims to localize and describe all events of interest in a video without requiring annotations of event boundaries. This setting poses a great challenge in accurately locating the temporal location of event, as the relevant supervision is unavailable. Existing methods rely on explicit alignment constraints between event locations and captions, which involve complex event proposal procedures during both training and inference. To tackle this problem, we propose a novel implicit location-caption alignment paradigm by complementary masking, which simplifies the complex event proposal and localization process while maintaining effectiveness. Specifically, our model comprises two components: a dual-mode video captioning module and a mask generation module. The dual-mode video captioning module captures global event information and generates descriptive captions, while the mask generation module generates differentiable positive and negative masks for localizing the events. These masks enable the implicit alignment of event locations and captions by ensuring that captions generated from positively and negatively masked videos are complementary, thereby forming a complete video description. In this way, even under weak supervision, the event location and event caption can be aligned implicitly. Extensive experiments on the public datasets demonstrate that our method outperforms existing weakly-supervised methods and achieves competitive results compared to fully-supervised methods.
翻译:弱监督密集视频描述(WSDVC)旨在无需事件边界标注的情况下,定位并描述视频中所有感兴趣的事件。由于缺乏相关监督信号,该设定在准确定位事件时间位置方面提出了巨大挑战。现有方法依赖于事件位置与描述之间的显式对齐约束,这需要在训练和推理阶段执行复杂的事件提议流程。为解决该问题,我们提出一种基于互补掩码的新型隐式位置-描述对齐范式,该方法在保持有效性的同时简化了复杂的事件提议与定位过程。具体而言,我们的模型包含两个组件:双模态视频描述模块和掩码生成模块。双模态视频描述模块捕获全局事件信息并生成描述性文本,而掩码生成模块则生成可微分的正负掩码以定位事件。这些掩码通过确保从正负掩码视频生成的描述具有互补性,从而实现事件位置与描述的隐式对齐,进而形成完整的视频描述。通过这种方式,即使在弱监督条件下,事件位置与事件描述也能实现隐式对齐。在公开数据集上的大量实验表明,我们的方法优于现有弱监督方法,并与全监督方法取得了具有竞争力的结果。