Despite recent successes with neural models for sign language translation (SLT), translation quality still lags behind spoken languages because of the data scarcity and modality gap between sign video and text. To address both problems, we investigate strategies for cross-modality representation sharing for SLT. We propose SLTUNET, a simple unified neural model designed to support multiple SLTrelated tasks jointly, such as sign-to-gloss, gloss-to-text and sign-to-text translation. Jointly modeling different tasks endows SLTUNET with the capability to explore the cross-task relatedness that could help narrow the modality gap. In addition, this allows us to leverage the knowledge from external resources, such as abundant parallel data used for spoken-language machine translation (MT). We show in experiments that SLTUNET achieves competitive and even state-of-the-art performance on PHOENIX-2014T and CSL-Daily when augmented with MT data and equipped with a set of optimization techniques. We further use the DGS Corpus for end-to-end SLT for the first time. It covers broader domains with a significantly larger vocabulary, which is more challenging and which we consider to allow for a more realistic assessment of the current state of SLT than the former two. Still, SLTUNET obtains improved results on the DGS Corpus. Code is available at https://github.com/bzhangGo/sltunet.
翻译:尽管神经模型在手语翻译(SLT)方面最近取得了一些成功,但由于数据稀缺以及手语视频与文本之间的模态差异,其翻译质量仍落后于口语语言。为了解决这两个问题,我们研究了用于SLT的跨模态表示共享策略。我们提出了SLTUNET,一种简单的统一神经模型,旨在联合支持多种SLT相关任务,例如手语到词汇、词汇到文本以及手语到文本翻译。联合建模不同任务使SLTUNET能够探索跨任务相关性,这有助于缩小模态差异。此外,这使我们能够利用外部资源的知识,例如用于口语机器翻译(MT)的大量平行数据。实验表明,在结合MT数据并配备一系列优化技术后,SLTUNET在PHOENIX-2014T和CSL-Daily上取得了具有竞争力甚至是最先进的性能。我们首次使用DGS语料库进行端到端SLT。该语料库涵盖更广泛的领域,词汇量显著更大,更具挑战性,我们认为它比前两者能更实际地评估SLT的当前状态。尽管如此,SLTUNET在DGS语料库上仍取得了改进的结果。代码可在https://github.com/bzhangGo/sltunet获取。