Transformer model have achieved state-of-the-art results in many applications like NLP, classification, etc. But their exploration in gesture recognition task is still limited. So, we propose a novel GestFormer architecture for dynamic hand gesture recognition. The motivation behind this design is to propose a resource efficient transformer model, since transformers are computationally expensive and very complex. So, we propose to use a pooling based token mixer named PoolFormer, since it uses only pooling layer which is a non-parametric layer instead of quadratic attention. The proposed model also leverages the space-invariant features of the wavelet transform and also the multiscale features are selected using multi-scale pooling. Further, a gated mechanism helps to focus on fine details of the gesture with the contextual information. This enhances the performance of the proposed model compared to the traditional transformer with fewer parameters, when evaluated on dynamic hand gesture datasets, NVidia Dynamic Hand Gesture and Briareo datasets. To prove the efficacy of the proposed model, we have experimented on single as well multimodal inputs such as infrared, normals, depth, optical flow and color images. We have also compared the proposed GestFormer in terms of resource efficiency and number of operations. The source code is available at https://github.com/mallikagarg/GestFormer.
翻译:Transformer模型在自然语言处理、分类等众多应用中取得了最先进的结果,但其在手势识别任务中的探索仍十分有限。为此,我们提出了一种新颖的GestFormer架构,用于动态手势识别。该设计的动机在于构建资源高效的Transformer模型——由于传统Transformer计算成本高昂且结构复杂,我们提出使用名为PoolFormer的基于池化的令牌混合器,它仅采用非参数化的池化层,无需二次注意力机制。该模型还利用了小波变换的空间不变性特征,并通过多尺度池化选择多尺度特征。此外,门控机制有助于聚焦手势的精细细节与上下文信息。与参数更少的传统Transformer相比,该模型在动态手势数据集(NVidia动态手势和Briareo数据集)上性能更优。为验证模型有效性,我们对红外、法向图、深度图、光流图和彩色图像等单模态及多模态输入进行了实验,并从资源效率和运算次数角度对GestFormer进行了对比。源代码已开源在https://github.com/mallikagarg/GestFormer。