The ability to distinguish between different movie scenes is critical for understanding the storyline of a movie. However, accurately detecting movie scenes is often challenging as it requires the ability to reason over very long movie segments. This is in contrast to most existing video recognition models, which are typically designed for short-range video analysis. This work proposes a State-Space Transformer model that can efficiently capture dependencies in long movie videos for accurate movie scene detection. Our model, dubbed TranS4mer, is built using a novel S4A building block, which combines the strengths of structured state-space sequence (S4) and self-attention (A) layers. Given a sequence of frames divided into movie shots (uninterrupted periods where the camera position does not change), the S4A block first applies self-attention to capture short-range intra-shot dependencies. Afterward, the state-space operation in the S4A block is used to aggregate long-range inter-shot cues. The final TranS4mer model, which can be trained end-to-end, is obtained by stacking the S4A blocks one after the other multiple times. Our proposed TranS4mer outperforms all prior methods in three movie scene detection datasets, including MovieNet, BBC, and OVSD, while also being $2\times$ faster and requiring $3\times$ less GPU memory than standard Transformer models. We will release our code and models.
翻译:区分不同电影场景的能力对于理解电影故事情节至关重要。然而,精确检测电影场景通常具有挑战性,因为这需要对极长的电影片段进行推理——这与大多数现有视频识别模型(通常专为短时视频分析设计)形成鲜明对比。本文提出一种状态空间Transformer模型,可高效捕获长电影视频中的依赖关系以实现精准场景检测。我们构建的名为TranS4mer的模型采用新型S4A基础模块,该模块融合了结构化状态空间序列(S4)与自注意力(A)层的优势。给定一组按电影镜头(摄像机位置不变的连续片段)分割的帧序列后,S4A模块首先通过自注意力捕获短程镜头内依赖关系,继而利用其中的状态空间操作聚合长程镜头间线索。通过多次堆叠S4A模块,最终获得可端到端训练的TranS4mer模型。在MovieNet、BBC和OVSD三个电影场景检测数据集上,本文提出的TranS4mer不仅超越了所有先前方法,其速度提升2倍且GPU内存消耗仅为标准Transformer模型的1/3。我们将开源代码与模型。