Motion representation plays an important role in video understanding and has many applications including action recognition, robot and autonomous guidance or others. Lately, transformer networks, through their self-attention mechanism capabilities, have proved their efficiency in many applications. In this study, we introduce a new two-stream transformer video classifier, which extracts spatio-temporal information from content and optical flow representing movement information. The proposed model identifies self-attention features across the joint optical flow and temporal frame domain and represents their relationships within the transformer encoder mechanism. The experimental results show that our proposed methodology provides excellent classification results on three well-known video datasets of human activities.
翻译:运动表征在视频理解中扮演着重要角色,在动作识别、机器人及自主导航等领域具有广泛应用。近年来,Transformer网络凭借其自注意力机制的能力,已在众多应用中证明了其有效性。本研究提出一种新型双流Transformer视频分类器,该模型从内容帧与表征运动信息的光流中提取时空信息。所提出的模型在联合光流与时序帧域中识别自注意力特征,并通过Transformer编码器机制表征其内在关联。实验结果表明,我们提出的方法在三个人类活动视频基准数据集上取得了优异的分类效果。