In this paper, we devise a mechanism for the addition of multi-modal information with an existing pipeline for continuous sign language recognition and translation. In our procedure, we have incorporated optical flow information with RGB images to enrich the features with movement-related information. This work studies the feasibility of such modality inclusion using a cross-modal encoder. The plugin we have used is very lightweight and doesn't need to include a separate feature extractor for the new modality in an end-to-end manner. We have applied the changes in both sign language recognition and translation, improving the result in each case. We have evaluated the performance on the RWTH-PHOENIX-2014 dataset for sign language recognition and the RWTH-PHOENIX-2014T dataset for translation. On the recognition task, our approach reduced the WER by 0.9, and on the translation task, our approach increased most of the BLEU scores by ~0.6 on the test set.
翻译:本文提出了一种机制,用于在现有连续手语识别与翻译流水线中融入多模态信息。在流程中,我们将光流信息与RGB图像相结合,以增强与运动相关的特征。本研究探讨了利用跨模态编码器实现此类模态融合的可行性。所使用的插件非常轻量,无需以端到端方式为新模态单独添加特征提取器。我们分别在手语识别与翻译任务中应用了该改进,并在两种情况下均提升了性能。我们在RWTH-PHOENIX-2014数据集上评估了手语识别效果,在RWTH-PHOENIX-2014T数据集上评估了翻译效果。在识别任务中,我们的方法将词错误率降低了0.9;在翻译任务中,我们的方法使测试集上多数BLEU分数提升了约0.6。