Recent adaptive methods for efficient video recognition mostly follow the two-stage paradigm of "preview-then-recognition" and have achieved great success on multiple video benchmarks. However, this two-stage paradigm involves two visits of raw frames from coarse-grained to fine-grained during inference (cannot be parallelized), and the captured spatiotemporal features cannot be reused in the second stage (due to varying granularity), being not friendly to efficiency and computation optimization. To this end, inspired by human cognition, we propose a novel recognition paradigm of "View while Moving" for efficient long-untrimmed video recognition. In contrast to the two-stage paradigm, our paradigm only needs to access the raw frame once. The two phases of coarse-grained sampling and fine-grained recognition are combined into unified spatiotemporal modeling, showing great performance. Moreover, we investigate the properties of semantic units in video and propose a hierarchical mechanism to efficiently capture and reason about the unit-level and video-level temporal semantics in long-untrimmed videos respectively. Extensive experiments on both long-untrimmed and short-trimmed videos demonstrate that our approach outperforms state-of-the-art methods in terms of accuracy as well as efficiency, yielding new efficiency and accuracy trade-offs for video spatiotemporal modeling.
翻译:近期针对高效视频识别的自适应方法大多遵循“预览-识别”的两阶段范式,并在多个视频基准测试上取得了显著成功。然而,这种两阶段范式在推理过程中需从粗粒度到细粒度两次访问原始帧(无法并行化),且捕获的时空特征因粒度差异而无法在第二阶段复用,对效率和计算优化不够友好。为此,受人类认知启发,我们提出一种新颖的“边看边动”识别范式,用于高效的长未裁剪视频识别。与两阶段范式不同,我们的范式仅需一次访问原始帧,将粗粒度采样与细粒度识别两个阶段融合为统一的时空建模,展现出优异性能。此外,我们探究了视频中语义单元的特性,并提出一种分层机制,分别高效捕获与推理长未裁剪视频中的单元级和视频级时间语义。在长未裁剪与短裁剪视频上的大量实验表明,我们的方法在精度和效率上均优于当前最优方法,为视频时空建模带来了新的效率-精度平衡。