Generic event boundary detection (GEBD) aims at pinpointing event boundaries naturally perceived by humans, playing a crucial role in understanding long-form videos. Given the diverse nature of generic boundaries, spanning different video appearances, objects, and actions, this task remains challenging. Existing methods usually detect various boundaries by the same protocol, regardless of their distinctive characteristics and detection difficulties, resulting in suboptimal performance. Intuitively, a more intelligent and reasonable way is to adaptively detect boundaries by considering their special properties. In light of this, we propose a novel dynamic pipeline for generic event boundaries named DyBDet. By introducing a multi-exit network architecture, DyBDet automatically learns the subnet allocation to different video snippets, enabling fine-grained detection for various boundaries. Besides, a multi-order difference detector is also proposed to ensure generic boundaries can be effectively identified and adaptively processed. Extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks, leading to obvious improvements in both performance and efficiency compared to the current state-of-the-art.
翻译:通用事件边界检测(GEBD)旨在精确定位人类自然感知的事件边界,在理解长视频内容中起着至关重要的作用。由于通用边界具有多样性,涉及不同的视频外观、对象和动作,该任务仍然具有挑战性。现有方法通常采用相同的协议检测各类边界,而忽略了它们各自的特性和检测难度,导致性能欠佳。直观上,一种更智能且合理的方式是通过考虑边界的特殊属性来自适应地检测边界。鉴于此,我们提出了一种名为DyBDet的新型动态流程用于通用事件边界检测。通过引入多出口网络架构,DyBDet自动学习不同视频片段的子网络分配,实现对各类边界的细粒度检测。此外,我们还提出了一种多阶差分检测器,以确保通用边界能够被有效识别并自适应处理。在具有挑战性的Kinetics-GEBD和TAPOS数据集上进行的大量实验表明,采用动态策略对GEBD任务具有显著益处,与当前最先进方法相比,在性能和效率上均带来了明显提升。