Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics.
翻译:在视频中识别人类动作需要同时理解空间与时间信息。现有大部分动作识别模型缺乏对视频时空信息的均衡理解。本文提出一种新颖的双流架构——时空交叉注意力机制(Cross-Attention in Space and Time, CAST),仅需RGB输入即可实现对视频时空信息的均衡理解。我们提出的瓶颈交叉注意力机制使空间专家模型与时间专家模型能够交换信息并做出协同预测,从而提升性能。通过在三个具有不同特性的公开基准数据集(EPIC-KITCHENS-100、Something-Something-V2和Kinetics-400)上进行大量实验验证,我们的方法在所有数据集上均展现出稳定的优越性能,而现有方法的性能则因数据集特性差异而出现波动。