End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations. It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data. To tackle these challenges, we propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation (i.e. video-to-text) by exploiting pseudo gloss-text pairs from the sign gloss translation model. Specifically, XmDA consists of two key components, namely, cross-modality mix-up and cross-modality knowledge distillation. The former explicitly encourages the alignment between sign video features and gloss embeddings to bridge the modality gap. The latter utilizes the generation knowledge from gloss-to-text teacher models to guide the spoken language text generation. Experimental results on two widely used SLT datasets, i.e., PHOENIX-2014T and CSL-Daily, demonstrate that the proposed XmDA framework significantly and consistently outperforms the baseline models. Extensive analyses confirm our claim that XmDA enhances spoken language text generation by reducing the representation distance between videos and texts, as well as improving the processing of low-frequency words and long sentences.
翻译:端到端手语翻译(SLT)旨在直接将手语视频转换为口语文本,无需中间表示。由于手语视频与文本之间的模态差异以及标注数据的稀缺性,该任务一直具有挑战性。为应对这些挑战,我们提出了一种新颖的跨模态数据增强(XmDA)框架,通过利用手语词汇翻译模型生成的伪词汇-文本对,将强大的词汇到文本翻译能力迁移至端到端手语翻译(即视频到文本)。具体而言,XmDA由两个关键组件组成:跨模态混合与跨模态知识蒸馏。前者显式地促进手语视频特征与词汇嵌入之间的对齐,以弥合模态差距;后者利用词汇到文本教师模型的生成知识,指导口语文本生成。在两个广泛使用的SLT数据集(即PHOENIX-2014T和CSL-Daily)上的实验结果表明,所提出的XmDA框架显著且一致地优于基线模型。大量分析证实了我们的主张:XmDA通过减少视频与文本之间的表示距离,并改善对低频词和长句的处理,从而增强了口语文本生成。