In this paper, we devise a mechanism for the addition of multi-modal information with an existing pipeline for continuous sign language recognition and translation. In our procedure, we have incorporated optical flow information with RGB images to enrich the features with movement-related information. This work studies the feasibility of such modality inclusion using a cross-modal encoder. The plugin we have used is very lightweight and doesn't need to include a separate feature extractor for the new modality in an end-to-end manner. We have applied the changes in both sign language recognition and translation, improving the result in each case. We have evaluated the performance on the RWTH-PHOENIX-2014 dataset for sign language recognition and the RWTH-PHOENIX-2014T dataset for translation. On the recognition task, our approach reduced the WER by 0.9, and on the translation task, our approach increased most of the BLEU scores by ~0.6 on the test set.
翻译:本文设计了一种将多模态信息融入现有连续手语识别与翻译管道的机制。我们通过将光流信息与RGB图像相结合,丰富了与运动相关的特征。本研究利用跨模态编码器探讨了此类模态融合的可行性。所使用的插件非常轻量,无需以端到端方式为新增模态单独配备特征提取器。我们将该变更分别应用于手语识别与翻译任务,均取得了性能提升。在手语识别任务上,我们在RWTH-PHOENIX-2014数据集上进行了评估,将词错误率降低了0.9;在手语翻译任务上,我们在RWTH-PHOENIX-2014T数据集上进行了评估,测试集上大多数BLEU分数提升了约0.6。