We address the task of American Sign Language fingerspelling translation using videos in the wild. We exploit advances in more accurate hand pose estimation and propose a novel architecture that leverages the transformer based encoder-decoder model enabling seamless contextual word translation. The translation model is augmented by a novel loss term that accurately predicts the length of the finger-spelled word, benefiting both training and inference. We also propose a novel two-stage inference approach that re-ranks the hypotheses using the language model capabilities of the decoder. Through extensive experiments, we demonstrate that our proposed method outperforms the state-of-the-art models on ChicagoFSWild and ChicagoFSWild+ achieving more than 10% relative improvement in performance. Our findings highlight the effectiveness of our approach and its potential to advance fingerspelling recognition in sign language translation. Code is also available at https://github.com/pooyafayyaz/Fingerspelling-PoseNet.
翻译:我们针对自然场景下美国手语手指拼写翻译任务开展研究。通过利用更精确的手部姿态估计技术进展,提出了一种基于Transformer编码器-解码器模型的新型架构,能够实现无缝的上下文词语翻译。该翻译模型通过一项创新的损失项进行增强,可精准预测手指拼写词语的长度,同时优化训练与推理过程。我们还提出了一种新颖的两阶段推理方法,利用解码器的语言模型能力对候选假设进行重排序。通过大量实验证明,所提方法在ChicagoFSWild和ChicagoFSWild+数据集上均优于现有最优模型,性能相对提升超过10%。研究结果凸显了我们方法的有效性及其在推进手语翻译中手指拼写识别方面的潜力。相关代码已开源:https://github.com/pooyafayyaz/Fingerspelling-PoseNet。