Sign language recognition and translation first uses a recognition module to generate glosses from sign language videos and then employs a translation module to translate glosses into spoken sentences. Most existing works focus on the recognition step, while paying less attention to sign language translation. In this work, we propose a task-aware instruction network, namely TIN-SLT, for sign language translation, by introducing the instruction module and the learning-based feature fuse strategy into a Transformer network. In this way, the pre-trained model's language ability can be well explored and utilized to further boost the translation performance. Moreover, by exploring the representation space of sign language glosses and target spoken language, we propose a multi-level data augmentation scheme to adjust the data distribution of the training set. We conduct extensive experiments on two challenging benchmark datasets, PHOENIX-2014-T and ASLG-PC12, on which our method outperforms former best solutions by 1.65 and 1.42 in terms of BLEU-4. Our code is published at https://github.com/yongcaoplus/TIN-SLT.
翻译:手语识别与翻译首先通过识别模块从手语视频生成手语标签,然后使用翻译模块将手语标签转换为口语语句。现有工作大多聚焦于识别步骤,而对翻译环节关注不足。本文提出一种任务感知指令网络TIN-SLT,通过将指令模块和基于学习的特征融合策略引入Transformer网络,用于手语翻译。该方法能够充分探索并利用预训练模型的语言能力,从而进一步提升翻译性能。此外,通过挖掘手语标签与目标口语表示空间,我们提出一种多层次数据增强方案以调整训练集的数据分布。在PHOENIX-2014-T和ASLG-PC12两个具有挑战性的基准数据集上的大量实验表明,我们的方法在BLEU-4指标上分别超越此前最优方案1.65和1.42。代码已发布于https://github.com/yongcaoplus/TIN-SLT。