Automatic Sign Language Translation requires the integration of both computer vision and natural language processing to effectively bridge the communication gap between sign and spoken languages. However, the deficiency in large-scale training data to support sign language translation means we need to leverage resources from spoken language. We introduce, Sign2GPT, a novel framework for sign language translation that utilizes large-scale pretrained vision and language models via lightweight adapters for gloss-free sign language translation. The lightweight adapters are crucial for sign language translation, due to the constraints imposed by limited dataset sizes and the computational requirements when training with long sign videos. We also propose a novel pretraining strategy that directs our encoder to learn sign representations from automatically extracted pseudo-glosses without requiring gloss order information or annotations. We evaluate our approach on two public benchmark sign language translation datasets, namely RWTH-PHOENIX-Weather 2014T and CSL-Daily, and improve on state-of-the-art gloss-free translation performance with a significant margin.
翻译:自动手语翻译需要计算机视觉与自然语言处理的融合,以有效弥合手语与口语之间的沟通鸿沟。然而,大规模训练数据的匮乏制约了手语翻译的发展,因此我们需要借助口语领域的资源。我们提出一种名为Sign2GPT的新型手语翻译框架,该框架通过轻量化适配器利用大规模预训练的视觉和语言模型,实现无需词汇标注的手语翻译。由于数据集规模有限以及长手语视频训练时的计算开销,轻量化适配器对手语翻译至关重要。我们还提出一种创新的预训练策略,该策略引导编码器从自动提取的伪词汇中学习手语表征,而无需词汇顺序信息或人工标注。我们在两个公开基准手语翻译数据集(RWTH-PHOENIX-Weather 2014T和CSL-Daily)上评估了该方法,并以显著优势提升了当前最先进的无词汇标注翻译性能。