In this paper, we introduce a novel Multiscale Video Transformer Network (MVTN) for dynamic hand gesture recognition, since multiscale features can extract features with variable size, pose, and shape of hand which is a challenge in hand gesture recognition. The proposed model incorporates a multiscale feature hierarchy to capture diverse levels of detail and context within hand gestures which enhances the model's ability. This multiscale hierarchy is obtained by extracting different dimensions of attention in different transformer stages with initial stages to model high-resolution features and later stages to model low-resolution features. Our approach also leverages multimodal data, utilizing depth maps, infrared data, and surface normals along with RGB images from NVGesture and Briareo datasets. Experiments show that the proposed MVTN achieves state-of-the-art results with less computational complexity and parameters. The source code is available at https://github.com/mallikagarg/MVTN.
翻译:本文提出了一种新颖的多尺度视频Transformer网络(MVTN),用于动态手势识别。由于手势在尺寸、姿态和形状上存在变化,这是手势识别中的一个挑战,而多尺度特征能够提取这些可变特征。所提出的模型结合了一个多尺度特征层次结构,以捕捉手势内部不同层次的细节和上下文信息,从而增强了模型的能力。该多尺度层次结构是通过在Transformer的不同阶段提取不同维度的注意力获得的:初始阶段建模高分辨率特征,后续阶段建模低分辨率特征。我们的方法还利用了多模态数据,结合了NVGesture和Briareo数据集中的深度图、红外数据、表面法线以及RGB图像。实验表明,所提出的MVTN以较低的计算复杂度和参数量取得了最先进的结果。源代码可在 https://github.com/mallikagarg/MVTN 获取。