State-of-the-art sign language translation (SLT) systems facilitate the learning process through gloss annotations, either in an end2end manner or by involving an intermediate step. Unfortunately, gloss labelled sign language data is usually not available at scale and, when available, gloss annotations widely differ from dataset to dataset. We present a novel approach using sentence embeddings of the target sentences at training time that take the role of glosses. The new kind of supervision does not need any manual annotation but it is learned on raw textual data. As our approach easily facilitates multilinguality, we evaluate it on datasets covering German (PHOENIX-2014T) and American (How2Sign) sign languages and experiment with mono- and multilingual sentence embeddings and translation systems. Our approach significantly outperforms other gloss-free approaches, setting the new state-of-the-art for data sets where glosses are not available and when no additional SLT datasets are used for pretraining, diminishing the gap between gloss-free and gloss-dependent systems.
翻译:当前最先进的手语翻译系统通常通过手语注释(gloss annotations)来辅助学习过程,其实现方式可以是端到端的,也可以包含中间处理步骤。然而,带有手语注释标注的大规模手语数据通常难以获得,且即使存在,不同数据集之间的手语注释标注方式也往往存在显著差异。本文提出一种新颖的方法,在训练时利用目标语句的句子嵌入来替代手语注释的作用。这种新型监督方式无需任何人工标注,而是直接从原始文本数据中学习得到。由于我们的方法易于实现多语言扩展,我们在涵盖德语手语(PHOENIX-2014T)和美国手语(How2Sign)的数据集上进行了评估,并尝试了单语与多语言句子嵌入及翻译系统。实验表明,我们的方法显著优于其他无需手语注释的方法,在无手语注释可用且未使用额外手语翻译数据集进行预训练的情况下,为相关数据集确立了新的性能标杆,从而缩小了无需手语注释系统与依赖手语注释系统之间的性能差距。