Sign Language Translation (SLT) is a challenging task due to its cross-domain nature, involving the translation of visual-gestural language to text. Many previous methods employ an intermediate representation, i.e., gloss sequences, to facilitate SLT, thus transforming it into a two-stage task of sign language recognition (SLR) followed by sign language translation (SLT). However, the scarcity of gloss-annotated sign language data, combined with the information bottleneck in the mid-level gloss representation, has hindered the further development of the SLT task. To address this challenge, we propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-VLP), which improves SLT by inheriting language-oriented prior knowledge from pre-trained models, without any gloss annotation assistance. Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training (CLIP) with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual Encoder and Text Decoder from the first stage. The seamless combination of these novel designs forms a robust sign language representation and significantly improves gloss-free sign language translation. In particular, we have achieved unprecedented improvements in terms of BLEU-4 score on the PHOENIX14T dataset (>+5) and the CSL-Daily dataset (>+3) compared to state-of-the-art gloss-free SLT methods. Furthermore, our approach also achieves competitive results on the PHOENIX14T dataset when compared with most of the gloss-based methods. Our code is available at https://github.com/zhoubenjia/GFSLT-VLP.
翻译:手语翻译(SLT)是一项具有挑战性的任务,因其跨领域特性涉及视觉-手势语言到文本的翻译。以往许多方法采用中间表示(即词汇序列)来辅助SLT,将其转化为手语识别(SLR)后接手语翻译(SLT)的两阶段任务。然而,词汇标注手语数据的稀缺性以及中层词汇表示中的信息瓶颈阻碍了SLT任务的进一步发展。为解决这一挑战,我们提出一种基于视觉-语言预训练的无词汇标注手语翻译方法(GFSLT-VLP),该方法通过继承预训练模型中的语言导向先验知识来改进SLT,无需任何词汇标注辅助。我们的方法包含两个阶段:(i)将对比语言-图像预训练(CLIP)与掩码自监督学习相结合,设计预训练任务以弥合视觉与文本表示之间的语义鸿沟并恢复掩码句子;(ii)构建具有编码器-解码器结构的端到端架构,继承第一阶段预训练视觉编码器和文本解码器的参数。这些创新设计的无缝结合形成了稳健的手语表示,显著改进了无词汇标注手语翻译。特别地,与最先进的无词汇标注手语翻译方法相比,我们在PHOENIX14T数据集(>+5)和CSL-Daily数据集(>+3)上的BLEU-4分数取得了前所未有的提升。此外,与大多数基于词汇标注的方法相比,我们的方法在PHOENIX14T数据集上也取得了具有竞争力的结果。我们的代码可访问https://github.com/zhoubenjia/GFSLT-VLP。