Towards the extraction of robust sign embeddings for low resource sign language recognition

Isolated Sign Language Recognition (SLR) has mostly been applied on datasets containing signs executed slowly and clearly by a limited group of signers. In real-world scenarios, however, we are met with challenging visual conditions, coarticulated signing, small datasets, and the need for signer independent models. To tackle this difficult problem, we require a robust feature extractor to process the sign language videos. One could expect human pose estimators to be ideal candidates. However, due to a domain mismatch with their training sets and challenging poses in sign language, they lack robustness on sign language data and image-based models often still outperform keypoint-based models. Furthermore, whereas the common practice of transfer learning with image-based models yields even higher accuracy, keypoint-based models are typically trained from scratch on every SLR dataset. These factors limit their usefulness for SLR. From the existing literature, it is also not clear which, if any, pose estimator performs best for SLR. We compare the three most popular pose estimators for SLR: OpenPose, MMPose and MediaPipe. We show that through keypoint normalization, missing keypoint imputation, and learning a pose embedding, we can obtain significantly better results and enable transfer learning. We show that keypoint-based embeddings contain cross-lingual features: they can transfer between sign languages and achieve competitive performance even when fine-tuning only the classifier layer of an SLR model on a target sign language. We furthermore achieve better performance using fine-tuned transferred embeddings than models trained only on the target sign language. The embeddings can also be learned in a multilingual fashion. The application of these embeddings could prove particularly useful for low resource sign languages in the future.

翻译：孤立手语识别（Isolated Sign Language Recognition, SLR）主要应用于由有限签约人群体缓慢且清晰地执行手语的数据集。然而，在现实场景中，我们面临挑战性的视觉条件、协同发音现象、小规模数据集以及签约人独立模型的需求。为解决这一难题，我们需要一个鲁棒的特征提取器来处理手语视频。人们可能预期人体姿态估计器是理想的选择，但由于其训练集与手语中具有挑战性的姿态之间存在领域不匹配，它们在手语数据上缺乏鲁棒性，而基于图像的模型通常仍优于基于关键点的模型。此外，尽管基于图像模型的迁移学习常规实践能获得更高的准确率，但基于关键点的模型通常在每个SLR数据集上从头开始训练。这些因素限制了它们在SLR中的实用性。从现有文献中，也不明确哪种姿态估计器（如果有）在SLR中表现最佳。我们比较了三种最常用的SLR姿态估计器：OpenPose、MMPose和MediaPipe。我们证明，通过关键点归一化、缺失关键点插补以及学习姿态嵌入，可以显著提升结果并实现迁移学习。研究表明，基于关键点的嵌入包含跨语言特征：它们可以在不同手语之间迁移，即使仅对目标手语的SLR模型分类器层进行微调，也能达到竞争性表现。此外，使用微调后的迁移嵌入比仅在目标手语上训练的模型取得了更好的性能。这些嵌入还可以以多语言方式学习。未来，这些嵌入的应用可能对低资源手语特别有用。