Video activity localization aims at understanding the semantic content in long untrimmed videos and retrieving actions of interest. The retrieved action with its start and end locations can be used for highlight generation, temporal action detection, etc. Unfortunately, learning the exact boundary location of activities is highly challenging because temporal activities are continuous in time, and there are often no clear-cut transitions between actions. Moreover, the definition of the start and end of events is subjective, which may confuse the model. To alleviate the boundary ambiguity, we propose to study the video activity localization problem from a denoising perspective. Specifically, we propose an encoder-decoder model named DenoiseLoc. During training, a set of action spans is randomly generated from the ground truth with a controlled noise scale. Then we attempt to reverse this process by boundary denoising, allowing the localizer to predict activities with precise boundaries and resulting in faster convergence speed. Experiments show that DenoiseLoc advances %in several video activity understanding tasks. For example, we observe a gain of +12.36% average mAP on QV-Highlights dataset and +1.64% [email protected] on THUMOS'14 dataset over the baseline. Moreover, DenoiseLoc achieves state-of-the-art performance on TACoS and MAD datasets, but with much fewer predictions compared to other current methods.
翻译:视频活动定位旨在理解长未修剪视频中的语义内容,并检索感兴趣的动作。检索到的动作及其起始和结束位置可用于亮点生成、时序动作检测等任务。然而,由于时间活动在时间上是连续的,且动作之间通常缺乏明确的过渡边界,学习活动的精确边界位置极具挑战性。此外,事件起始和结束的定义具有主观性,这可能使模型产生混淆。为缓解边界模糊问题,我们提出从去噪角度研究视频活动定位问题。具体而言,我们提出了一种名为DenoiseLoc的编码器-解码器模型。在训练过程中,我们从带受控噪声尺度的真实标注中随机生成一组动作区间,随后通过边界去噪尝试逆向该过程,使定位器能够预测具有精确边界的活动,并实现更快的收敛速度。实验表明,DenoiseLoc在多个视频活动理解任务中取得性能提升。例如,在QV-Highlights数据集上平均mAP提升12.36%,在THUMOS’14数据集上[email protected]提升1.64%。此外,DenoiseLoc在TACoS和MAD数据集上达到当前最优性能,且预测数量远少于其他现有方法。