MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition

Sign language recognition (SLR) has long been plagued by insufficient model representation capabilities. Although current pre-training approaches have alleviated this dilemma to some extent and yielded promising performance by employing various pretext tasks on sign pose data, these methods still suffer from two primary limitations: 1) Explicit motion information is usually disregarded in previous pretext tasks, leading to partial information loss and limited representation capability. 2) Previous methods focus on the local context of a sign pose sequence, without incorporating the guidance of the global meaning of lexical signs. To this end, we propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information in a self-supervised learning paradigm for SLR. Our framework contains two crucial components, i.e., a motion-aware masked autoencoder (MA) and a momentum semantic alignment module (SA). Specifically, in MA, we introduce an autoencoder architecture with a motion-aware masked strategy to reconstruct motion residuals of masked frames, thereby explicitly exploring dynamic motion cues among sign pose sequences. Moreover, in SA, we embed our framework with global semantic awareness by aligning the embeddings of different augmented samples from the input sequence in the shared latent space. In this way, our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation. Furthermore, we conduct extensive experiments to validate the effectiveness of our method, achieving new state-of-the-art performance on four public benchmarks.

翻译：手语识别长期以来受限于模型表征能力不足的问题。尽管当前预训练方法通过在姿态数据上设计多种代理任务，在一定程度上缓解了这一困境并取得了良好性能，但这些方法仍存在两个主要局限：1）先前代理任务通常忽略显式运动信息，导致部分信息丢失和表征能力受限；2）现有方法仅关注手语姿态序列的局部上下文，未能结合词汇手势全局语义的指导。为此，我们提出一种融合丰富运动线索与全局语义信息的自监督学习框架——运动感知掩码自编码器与语义对齐方法（MASA）。该框架包含两个核心组件：运动感知掩码自编码器（MA）与动量语义对齐模块（SA）。具体而言，在MA中我们采用具有运动感知掩码策略的自编码器架构，通过重建掩码帧的运动残差来显式挖掘手语姿态序列间的动态运动线索。此外，在SA中我们通过对齐输入序列不同增强样本在共享隐空间中的嵌入表示，为框架注入全局语义感知能力。通过这种方式，我们的框架能够同时学习局部运动线索与全局语义特征，从而实现全面手语表征。我们在四个公开基准数据集上进行了大量实验验证方法的有效性，均取得了最新的最优性能。

相关内容

自编码器

关注 141

自动编码器是一种人工神经网络，用于以无监督的方式学习有效的数据编码。自动编码器的目的是通过训练网络忽略信号“噪声”来学习一组数据的表示（编码），通常用于降维。与简化方面一起，学习了重构方面，在此，自动编码器尝试从简化编码中生成尽可能接近其原始输入的表示形式，从而得到其名称。基本模型存在几种变体，其目的是迫使学习的输入表示形式具有有用的属性。自动编码器可有效地解决许多应用问题，从面部识别到获取单词的语义。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日