Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training. Without instance-level annotations, most existing methods follow the Segment-based Multiple Instance Learning (S-MIL) framework, where the predictions of segments are supervised by the labels of videos. However, the objective for acquiring segment-level scores during training is not consistent with the target for acquiring proposal-level scores during testing, leading to suboptimal results. To deal with this problem, we propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages, which includes three key designs: 1) a surrounding contrastive feature extraction module to suppress the discriminative short proposals by considering the surrounding contrastive information, 2) a proposal completeness evaluation module to inhibit the low-quality proposals with the guidance of the completeness pseudo labels, and 3) an instance-level rank consistency loss to achieve robust detection by leveraging the complementarity of RGB and FLOW modalities. Extensive experimental results on two challenging benchmarks including THUMOS14 and ActivityNet demonstrate the superior performance of our method.
翻译:弱监督时序动作定位旨在仅使用训练时的视频级类别标签,在未修剪视频中定位并识别动作。由于缺乏实例级标注,现有方法大多遵循基于片段的多实例学习(S-MIL)框架,其中片段的预测结果由视频标签监督。然而,训练时获取片段级分数的目标与测试时获取提案级分数的目标不一致,导致次优结果。为解决此问题,我们提出了一种新颖的基于提案的多实例学习(P-MIL)框架,该框架在训练和测试阶段直接对候选提案进行分类,其中包含三个关键设计:1)一个环绕对比特征提取模块,通过考虑周围对比信息抑制具有判别性的短提案;2)一个提案完整性评估模块,在完整性伪标签的指导下抑制低质量提案;3)一个实例级排名一致性损失,通过利用RGB和光流模态的互补性实现鲁棒检测。在包括THUMOS14和ActivityNet在内的两个具有挑战性基准上的大量实验结果表明,我们的方法具有优越性能。