SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning

Sign language recognition (SLR) plays a vital role in facilitating communication for the hearing-impaired community. SLR is a weakly supervised task where entire videos are annotated with glosses, making it challenging to identify the corresponding gloss within a video segment. Recent studies indicate that the main bottleneck in SLR is the insufficient training caused by the limited availability of large-scale datasets. To address this challenge, we present SignVTCL, a multi-modal continuous sign language recognition framework enhanced by visual-textual contrastive learning, which leverages the full potential of multi-modal data and the generalization ability of language model. SignVTCL integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone, thereby yielding more robust visual representations. Furthermore, SignVTCL contains a visual-textual alignment approach incorporating gloss-level and sentence-level alignment to ensure precise correspondence between visual features and glosses at the level of individual glosses and sentence. Experimental results conducted on three datasets, Phoenix-2014, Phoenix-2014T, and CSL-Daily, demonstrate that SignVTCL achieves state-of-the-art results compared with previous methods.

翻译：手语识别（SLR）在促进听障群体沟通中起着至关重要的作用。SLR 是一项弱监督任务，整个视频被标注为词汇序列，这使得在视频片段中识别相应的词汇具有挑战性。近期研究表明，SLR 的主要瓶颈在于大规模数据集可用性有限导致的训练不足。为应对这一挑战，我们提出了 SignVTCL——一种通过视觉-文本对比学习增强的多模态连续手语识别框架，该框架充分利用多模态数据的潜力以及语言模型的泛化能力。SignVTCL 同时整合多模态数据（视频、关键点和光流）训练统一的视觉骨干网络，从而生成更鲁棒的视觉表征。此外，SignVTCL 包含一种视觉-文本对齐方法，该方法融合了词汇级和句子级对齐，以确保在单个词汇和句子级别上视觉特征与词汇之间的精确对应。在 Phoenix-2014、Phoenix-2014T 和 CSL-Daily 三个数据集上的实验结果表明，与先前方法相比，SignVTCL 达到了最先进的性能。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日