In this paper, we devise a mechanism for the addition of multi-modal information with an existing pipeline for continuous sign language recognition and translation. In our procedure, we have incorporated optical flow information with RGB images to enrich the features with movement-related information. This work studies the feasibility of such modality inclusion using a cross-modal encoder. The plugin we have used is very lightweight and doesn't need to include a separate feature extractor for the new modality in an end-to-end manner. We have applied the changes in both sign language recognition and translation, improving the result in each case. We have evaluated the performance on the RWTH-PHOENIX-2014 dataset for sign language recognition and the RWTH-PHOENIX-2014T dataset for translation. On the recognition task, our approach reduced the WER by 0.9, and on the translation task, our approach increased most of the BLEU scores by ~0.6 on the test set.
翻译:本文提出了一种机制,用于在现有连续手语识别与翻译管道中融合多模态信息。在我们的流程中,我们将光流信息与RGB图像结合,以丰富与运动相关的特征。本研究通过跨模态编码器探讨了这种模态融合的可行性。我们所使用的插件极其轻量,无需在端到端方式中为新模态引入独立的特征提取器。我们将此改进同时应用于手语识别与翻译任务,并在两项任务中均取得了性能提升。我们在RWTH-PHOENIX-2014数据集上评估了手语识别性能,在RWTH-PHOENIX-2014T数据集上评估了翻译性能。在识别任务中,我们的方法将词错误率(WER)降低了0.9;在翻译任务中,我们的方法在测试集上将多数BLEU分数提升了约0.6。