In this paper, we tackle the problem of sign language translation (SLT) without gloss annotations. Although intermediate representation like gloss has been proven effective, gloss annotations are hard to acquire, especially in large quantities. This limits the domain coverage of translation datasets, thus handicapping real-world applications. To mitigate this problem, we design the Gloss-Free End-to-end sign language translation framework (GloFE). Our method improves the performance of SLT in the gloss-free setting by exploiting the shared underlying semantics of signs and the corresponding spoken translation. Common concepts are extracted from the text and used as a weak form of intermediate representation. The global embedding of these concepts is used as a query for cross-attention to find the corresponding information within the learned visual features. In a contrastive manner, we encourage the similarity of query results between samples containing such concepts and decrease those that do not. We obtained state-of-the-art results on large-scale datasets, including OpenASL and How2Sign. The code and model will be available at https://github.com/HenryLittle/GloFE.
翻译:本文研究无需注释标注的手语翻译(SLT)问题。尽管中间表示(如gloss)已被证明有效,但gloss注释难以获取,尤其是大规模注释,这限制了翻译数据集的领域覆盖范围,从而阻碍了实际应用。为解决这一问题,我们设计了无注释端到端手语翻译框架(GloFE)。该方法通过挖掘手语与对应口语翻译之间共享的底层语义,在无注释设定下提升了SLT性能。我们从文本中提取通用概念,并将其用作一种弱形式的中间表示。这些概念的全局嵌入被用作交叉注意力机制的查询项,以在学习的视觉特征中定位对应信息。通过对比学习,我们鼓励包含此类概念的样本在查询结果上具有相似性,同时降低不含这些概念的样本的相似性。我们在OpenASL和How2Sign等大规模数据集上取得了最先进的结果。代码与模型将发布在https://github.com/HenryLittle/GloFE。