Sign language recognition (SLR) has long been plagued by insufficient model representation capabilities. Although current pre-training approaches have alleviated this dilemma to some extent and yielded promising performance by employing various pretext tasks on sign pose data, these methods still suffer from two primary limitations: 1) Explicit motion information is usually disregarded in previous pretext tasks, leading to partial information loss and limited representation capability. 2) Previous methods focus on the local context of a sign pose sequence, without incorporating the guidance of the global meaning of lexical signs. To this end, we propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information in a self-supervised learning paradigm for SLR. Our framework contains two crucial components, i.e., a motion-aware masked autoencoder (MA) and a momentum semantic alignment module (SA). Specifically, in MA, we introduce an autoencoder architecture with a motion-aware masked strategy to reconstruct motion residuals of masked frames, thereby explicitly exploring dynamic motion cues among sign pose sequences. Moreover, in SA, we embed our framework with global semantic awareness by aligning the embeddings of different augmented samples from the input sequence in the shared latent space. In this way, our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation. Furthermore, we conduct extensive experiments to validate the effectiveness of our method, achieving new state-of-the-art performance on four public benchmarks.
翻译:手语识别长期以来受限于模型表征能力不足的问题。尽管当前预训练方法通过在姿态数据上设计多种代理任务,在一定程度上缓解了这一困境并取得了良好性能,但这些方法仍存在两个主要局限:1)先前代理任务通常忽略显式运动信息,导致部分信息丢失和表征能力受限;2)现有方法仅关注手语姿态序列的局部上下文,未能结合词汇手势全局语义的指导。为此,我们提出一种融合丰富运动线索与全局语义信息的自监督学习框架——运动感知掩码自编码器与语义对齐方法(MASA)。该框架包含两个核心组件:运动感知掩码自编码器(MA)与动量语义对齐模块(SA)。具体而言,在MA中我们采用具有运动感知掩码策略的自编码器架构,通过重建掩码帧的运动残差来显式挖掘手语姿态序列间的动态运动线索。此外,在SA中我们通过对齐输入序列不同增强样本在共享隐空间中的嵌入表示,为框架注入全局语义感知能力。通过这种方式,我们的框架能够同时学习局部运动线索与全局语义特征,从而实现全面手语表征。我们在四个公开基准数据集上进行了大量实验验证方法的有效性,均取得了最新的最优性能。