Towards the extraction of robust sign embeddings for low resource sign language recognition

Isolated Sign Language Recognition (SLR) has mostly been applied on relatively large datasets containing signs executed slowly and clearly by a limited group of signers. In real-world scenarios, however, we are met with challenging visual conditions, coarticulated signing, small datasets, and the need for signer independent models. To tackle this difficult problem, we require a robust feature extractor to process the sign language videos. One could expect human pose estimators to be ideal candidates. However, due to a domain mismatch with their training sets and challenging poses in sign language, they lack robustness on sign language data and image based models often still outperform keypoint based models. Furthermore, whereas the common practice of transfer learning with image based models yields even higher accuracy, keypoint based models are typically trained from scratch on every SLR dataset. These factors limit their usefulness for SLR. From the existing literature, it is also not clear which, if any, pose estimator performs best for SLR. We compare the three most popular pose estimators for SLR: OpenPose, MMPose and MediaPipe. We show that through keypoint normalization, missing keypoint imputation, and learning a pose embedding, we can obtain significantly better results and enable transfer learning. We show that keypoint-based embeddings contain cross-lingual features: they can transfer between sign languages and achieve competitive performance even when fine-tuning only the classifier layer of an SLR model on a target sign language. We furthermore achieve better performance using fine-tuned transferred embeddings than models trained only on the target sign language. The application of these embeddings could prove particularly useful for low resource sign languages in the future.

翻译：孤立手语识别主要应用于相对大型的数据集，这些数据集包含由有限数量的手语者缓慢且清晰地执行的手势。然而，在现实场景中，我们面临的是具有挑战性的视觉条件、协同发音、小数据集以及对说话人无关模型的需求。为解决这一难题，我们需要一个鲁棒的特征提取器来处理手语视频。人们可能期望人体姿态估计器成为理想选择，但由于其训练集与手语中具有挑战性的姿态之间存在领域不匹配，这些估计器在手语数据上缺乏鲁棒性，因此基于图像的模型通常仍优于基于关键点的模型。此外，尽管通过基于图像模型进行迁移学习的常见做法能实现更高的准确率，但基于关键点的模型通常在每个SLR数据集上从头开始训练。这些因素限制了它们在手语识别中的实用性。从现有文献中，也不清楚哪种姿态估计器（如果存在的话）最适合SLR。我们比较了三种最流行的用于SLR的姿态估计器：OpenPose、MMPose和MediaPipe。我们证明，通过关键点归一化、缺失关键点插补以及学习姿态嵌入，可以显著提升效果并实现迁移学习。我们证明基于关键点的嵌入包含跨语言特征：它们可以在不同手语之间迁移，并且即使在目标手语上仅微调SLR模型的分类器层，也能达到具有竞争力的性能。此外，与仅在目标手语上训练的模型相比，使用微调后的迁移嵌入能获得更好的性能。这些嵌入的应用未来可能对低资源手语尤为有用。