Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focuses merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.
翻译:弱监督密集视频描述旨在仅利用描述标注(不含时间边界)训练模型,实现对视频中事件的定位与描述。先前研究引入了基于高斯掩码与互补描述的隐式监督范式。然而,现有方法仅关注生成非重叠掩码,未考虑掩码与对应事件的语义关联,导致生成的掩码分布单一且均匀,难以捕捉语义关键区域。此外,由于现有数据集固有的稀疏性,仅依赖真实描述会导致性能欠佳。本研究提出SAIL方法,通过跨模态对齐构建语义感知掩码。我们提出的相似性感知训练目标引导掩码强调与对应事件描述高度相似的视频区域。进一步地,为在稀疏标注条件下引导更精确的掩码生成,我们引入基于大语言模型的增强策略,通过生成合成描述提供额外的对齐信号。这些合成描述通过掩码间交互机制进行融合,在不影响主目标的前提下为精确时间定位提供辅助引导。在ActivityNet Captions和YouCook2数据集上的实验表明,该方法在描述生成与事件定位指标上均达到最先进性能。