Translating spoken languages into Sign languages is necessary for open communication between the hearing and hearing-impaired communities. To achieve this goal, we propose the first method for animating a text written in HamNoSys, a lexical Sign language notation, into signed pose sequences. As HamNoSys is universal by design, our proposed method offers a generic solution invariant to the target Sign language. Our method gradually generates pose predictions using transformer encoders that create meaningful representations of the text and poses while considering their spatial and temporal information. We use weak supervision for the training process and show that our method succeeds in learning from partial and inaccurate data. Additionally, we offer a new distance measurement that considers missing keypoints, to measure the distance between pose sequences using DTW-MJE. We validate its correctness using AUTSL, a large-scale Sign language dataset, show that it measures the distance between pose sequences more accurately than existing measurements, and use it to assess the quality of our generated pose sequences. Code for the data pre-processing, the model, and the distance measurement is publicly released for future research.
翻译:将口语翻译为手语对于听障群体与健听群体之间的开放交流至关重要。为实现这一目标,我们提出首个方法,将用HamNoSys(一种词汇手语符号系统)书写的文本动画化为手语姿态序列。由于HamNoSys在设计上具有通用性,本方法为不同目标手语提供了通用解决方案。我们通过Transformer编码器逐步生成姿态预测,这些编码器在考虑空间和时间信息的同时,为文本和姿态创建有意义的表征。训练过程中采用弱监督学习,证明该方法能从部分且不准确的数据中成功学习。此外,我们提出一种考虑缺失关键点的新距离度量方法,使用DTW-MJE测量姿态序列间的距离。通过大规模手语数据集AUTSL验证其正确性,表明该度量比现有方法更准确地测量姿态序列距离,并用于评估生成姿态序列的质量。数据预处理、模型及距离度量的代码已公开发布,以供未来研究使用。