This paper describes the DFKI-MLT submission to the WMT-SLT 2022 sign language translation (SLT) task from Swiss German Sign Language (video) into German (text). State-of-the-art techniques for SLT use a generic seq2seq architecture with customized input embeddings. Instead of word embeddings as used in textual machine translation, SLT systems use features extracted from video frames. Standard approaches often do not benefit from temporal features. In our participation, we present a system that learns spatio-temporal feature representations and translation in a single model, resulting in a real end-to-end architecture expected to better generalize to new data sets. Our best system achieved $5\pm1$ BLEU points on the development set, but the performance on the test dropped to $0.11\pm0.06$ BLEU points.
翻译:本文介绍了DFKI-MLT团队为WMT-SLT 2022手语翻译任务提交的系统,该系统将瑞士德语手语视频翻译为德语文本。当前最先进的手语翻译技术通常采用带有定制化输入嵌入的通用序列到序列架构。与文本机器翻译中使用的词嵌入不同,手语翻译系统使用从视频帧中提取的特征。标准方法往往未能充分利用时序特征。在本研究中,我们提出了一种在单一模型中学习时空特征表征并完成翻译的系统,从而构建了真正的端到端架构,有望更好地泛化到新数据集。我们的最佳系统在开发集上取得了$5\pm1$ BLEU分,但在测试集上的性能下降至$0.11\pm0.06$ BLEU分。