This work dedicates to continuous sign language recognition (CSLR), which is a weakly supervised task dealing with the recognition of continuous signs from videos, without any prior knowledge about the temporal boundaries between consecutive signs. Data scarcity heavily impedes the progress of CSLR. Existing approaches typically train CSLR models on a monolingual corpus, which is orders of magnitude smaller than that of speech recognition. In this work, we explore the feasibility of utilizing multilingual sign language corpora to facilitate monolingual CSLR. Our work is built upon the observation of cross-lingual signs, which originate from different sign languages but have similar visual signals (e.g., hand shape and motion). The underlying idea of our approach is to identify the cross-lingual signs in one sign language and properly leverage them as auxiliary training data to improve the recognition capability of another. To achieve the goal, we first build two sign language dictionaries containing isolated signs that appear in two datasets. Then we identify the sign-to-sign mappings between two sign languages via a well-optimized isolated sign language recognition model. At last, we train a CSLR model on the combination of the target data with original labels and the auxiliary data with mapped labels. Experimentally, our approach achieves state-of-the-art performance on two widely-used CSLR datasets: Phoenix-2014 and Phoenix-2014T.
翻译:本文致力于连续手语识别(CSLR),这是一项弱监督任务,旨在从视频中识别连续手语,无需事先知晓连续手语之间的时间边界。数据匮乏严重阻碍了CSLR的发展。现有方法通常使用单语语料库训练CSLR模型,其规模比语音识别语料库小数个数量级。本文探索了利用多语种手语语料库促进单语CSLR的可行性。我们的工作基于对跨语种手语的观察——这些手语源自不同手语语言,但具有相似的视觉信号(例如手势形状和运动)。该方法的核心思想是识别一种手语中的跨语种手语,并将其作为辅助训练数据合理利用,以提升另一种手语的识别能力。为实现此目标,我们首先构建了两个包含数据集中孤立手语的手语词典。然后通过优化良好的孤立手语识别模型,识别两种手语间的手语到手语映射关系。最后,我们在目标数据(含原始标签)与辅助数据(含映射标签)的组合上训练CSLR模型。实验结果表明,本方法在两个广泛使用的CSLR数据集Phoenix-2014和Phoenix-2014T上达到了最优性能。