In this work, our goals are two fold: large-vocabulary continuous sign language recognition (CSLR), and sign language retrieval. To this end, we introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. To enable CSLR evaluation in the large-vocabulary setting, we introduce new dataset annotations that have been manually collected. These provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. We demonstrate that by a careful choice of loss functions, training the model for both the CSLR and retrieval tasks is mutually beneficial in terms of performance -- retrieval improves CSLR performance by providing context, while CSLR improves retrieval with more fine-grained supervision. We further show the benefits of leveraging weak and noisy supervision from large-vocabulary datasets such as BOBSL, namely sign-level pseudo-labels, and English subtitles. Our model significantly outperforms the previous state of the art on both tasks.
翻译:本文旨在实现两个目标:大词汇量连续手语识别(CSLR)与手语检索。为此,我们提出一种多任务Transformer模型CSLR2,该模型能够处理手语序列,并在手语与口语文本的联合嵌入空间中输出结果。为在大词汇量场景下评估CSLR性能,我们引入手工标注的新数据集注释,涵盖六小时测试视频的连续手语层级标注,并将公开该数据集。研究表明,通过精心选择损失函数,联合训练CSLR与检索任务可产生相互增益:检索通过提供上下文提升CSLR性能,而CSLR则通过更细粒度的监督改善检索效果。我们进一步展示了利用BOBSL等大词汇量数据集中的弱监督与噪声监督(包括手语层级伪标签与英文字幕)的优势。所提模型在这两项任务上均显著超越现有最优方法。