Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal

Most deep-learning-based continuous sign language recognition (CSLR) models share a similar backbone consisting of a visual module, a sequential module, and an alignment module. However, due to limited training samples, a connectionist temporal classification loss may not train such CSLR backbones sufficiently. In this work, we propose three auxiliary tasks to enhance the CSLR backbones. The first task enhances the visual module, which is sensitive to the insufficient training problem, from the perspective of consistency. Specifically, since the information of sign languages is mainly included in signers' facial expressions and hand movements, a keypoint-guided spatial attention module is developed to enforce the visual module to focus on informative regions, i.e., spatial attention consistency. Second, noticing that both the output features of the visual and sequential modules represent the same sentence, to better exploit the backbone's power, a sentence embedding consistency constraint is imposed between the visual and sequential modules to enhance the representation power of both features. We name the CSLR model trained with the above auxiliary tasks as consistency-enhanced CSLR, which performs well on signer-dependent datasets in which all signers appear during both training and testing. To make it more robust for the signer-independent setting, a signer removal module based on feature disentanglement is further proposed to remove signer information from the backbone. Extensive ablation studies are conducted to validate the effectiveness of these auxiliary tasks. More remarkably, with a transformer-based backbone, our model achieves state-of-the-art or competitive performance on five benchmarks, PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, and CSL-Daily. Code and Models are available at https://github.com/2000ZRL/LCSA_C2SLR_SRM.

翻译：大多数基于深度学习的连续手语识别（CSLR）模型共享相似的骨干结构，由视觉模块、时序模块和对齐模块组成。然而，由于训练样本有限，连接主义时序分类损失可能无法充分训练此类CSLR骨干网络。本文提出三项辅助任务来增强CSLR骨干网络。第一项任务从一致性角度增强对训练不足问题敏感的视觉模块：具体而言，鉴于手语信息主要包含在手语者的面部表情和手部动作中，我们开发了一种关键点引导的空间注意力模块，强制视觉模块聚焦于信息丰富区域，即空间注意力一致性。第二，考虑到视觉模块和时序模块的输出特征均表征相同语句，为更好发挥骨干网络能力，我们在视觉模块与时序模块之间施加语句嵌入一致性约束，以增强两种特征的表达能力。我们将采用上述辅助任务训练的CSLR模型称为一致性增强CSLR，该模型在训练和测试阶段均包含所有手语者的手语者相关数据集上表现优异。为使其对手语者无关设置更具鲁棒性，我们进一步提出基于特征解耦的手语者移除模块，从骨干网络中去除手语者信息。通过大量消融实验验证了这些辅助任务的有效性。值得注意的是，采用基于Transformer的骨干网络后，我们的模型在PHOENIX-2014、PHOENIX-2014-T、PHOENIX-2014-SI、CSL和CSL-Daily五个基准数据集上达到了最先进或具有竞争力的性能。代码与模型已开源至https://github.com/2000ZRL/LCSA_C2SLR_SRM。