Most sign language translation (SLT) methods to date require the use of gloss annotations to provide additional supervision information, however, the acquisition of gloss is not easy. To solve this problem, we first perform an analysis of existing models to confirm how gloss annotations make SLT easier. We find that it can provide two aspects of information for the model, 1) it can help the model implicitly learn the location of semantic boundaries in continuous sign language videos, 2) it can help the model understand the sign language video globally. We then propose \emph{gloss attention}, which enables the model to keep its attention within video segments that have the same semantics locally, just as gloss helps existing models do. Furthermore, we transfer the knowledge of sentence-to-sentence similarity from the natural language model to our gloss attention SLT network (GASLT) to help it understand sign language videos at the sentence level. Experimental results on multiple large-scale sign language datasets show that our proposed GASLT model significantly outperforms existing methods. Our code is provided in \url{https://github.com/YinAoXiong/GASLT}.
翻译:大多数现有的手语翻译方法都需要使用标注信息提供额外的监督信号,然而获取标注并不容易。为解决这一问题,我们首先分析现有模型,确认标注如何使手语翻译变得更容易。我们发现,标注能为模型提供两方面的信息:1)帮助模型隐式学习连续手语视频中语义边界的定位;2)帮助模型从全局理解手语视频。为此,我们提出"注意力机制",使模型能够像标注辅助现有模型一样,将注意力集中在局部具有相同语义的视频片段上。此外,我们将自然语言模型中句子间相似性知识迁移到所提出的注意力手语翻译网络(GASLT)中,以辅助模型从句子层面理解手语视频。在多个大规模手语数据集上的实验结果表明,我们提出的GASLT模型显著优于现有方法。相关代码已开源在:https://github.com/YinAoXiong/GASLT。