Changes in facial expression, head movement, body movement and gesture movement are remarkable cues in sign language recognition, and most of the current continuous sign language recognition(CSLR) research methods mainly focus on static images in video sequences at the frame-level feature extraction stage, while ignoring the dynamic changes in the images. In this paper, we propose a novel motor attention mechanism to capture the distorted changes in local motion regions during sign language expression, and obtain a dynamic representation of image changes. And for the first time, we apply the self-distillation method to frame-level feature extraction for continuous sign language, which improves the feature expression without increasing the computational resources by self-distilling the features of adjacent stages and using the higher-order features as teachers to guide the lower-order features. The combination of the two constitutes our proposed holistic model of CSLR Based on motor attention mechanism and frame-level Self-Distillation (MAM-FSD), which improves the inference ability and robustness of the model. We conduct experiments on three publicly available datasets, and the experimental results show that our proposed method can effectively extract the sign language motion information in videos, improve the accuracy of CSLR and reach the state-of-the-art level.
翻译:面部表情、头部运动、身体运动及手势动作的变化是手语识别中的显著线索,而当前大多数连续手语识别(CSLR)研究方法在帧级特征提取阶段主要关注视频序列中的静态图像,忽略了图像的动态变化。本文提出了一种新颖的运动注意力机制,以捕捉手语表达过程中局部运动区域的畸变变化,并获取图像变化的动态表征。此外,我们首次将自蒸馏方法应用于连续手语的帧级特征提取,通过相邻阶段特征的自蒸馏,利用高阶特征作为教师指导低阶特征,在不增加计算资源的情况下提升了特征表达能力。两者的结合构成了我们提出的基于运动注意力机制和帧级自蒸馏的CSLR整体模型(MAM-FSD),增强了模型的推理能力与鲁棒性。我们在三个公开数据集上进行了实验,结果表明,所提方法能够有效提取视频中的手语运动信息,提高了CSLR的准确性,并达到了当前最优水平。