Sign language, essential for the deaf and hard-of-hearing, presents unique challenges in translation and production due to its multimodal nature and the inherent ambiguity in mapping sign language motion to spoken language words. Previous methods often rely on gloss annotations, requiring time-intensive labor and specialized expertise in sign language. Gloss-free methods have emerged to address these limitations, but they often depend on external sign language data or dictionaries, failing to completely eliminate the need for gloss annotations. There is a clear demand for a comprehensive approach that can supplant gloss annotations and be utilized for both Sign Language Translation (SLT) and Sign Language Production (SLP). We introduce Universal Gloss-level Representation (UniGloR), a unified and self-supervised solution for both SLT and SLP, trained on multiple datasets including PHOENIX14T, How2Sign, and NIASL2021. Our results demonstrate UniGloR's effectiveness in the translation and production tasks. We further report an encouraging result for the Sign Language Recognition (SLR) on previously unseen data. Our study suggests that self-supervised learning can be made in a unified manner, paving the way for innovative and practical applications in future research.
翻译:手语作为聋哑及听力障碍群体的重要交流方式,因其多模态特性及手语动作与口语词汇映射的固有歧义性,在翻译与生成任务中面临独特挑战。现有方法多依赖于注释符号,需要耗费大量时间且需具备手语专业知识。为克服这些局限,无注释方法应运而生,但它们往往依赖外部手语数据或词典,未能完全摆脱对注释符号的依赖。学界亟需一种能够替代注释符号、并同时适用于手语翻译与手语生成的综合性方法。本文提出通用手语级表征,这是一种面向手语翻译与手语生成任务的统一自监督解决方案,在PHOENIX14T、How2Sign和NIASL2021等多个数据集上进行训练。实验结果表明,UniGloR在翻译与生成任务中均表现出色。我们进一步报告了在未见数据上手语识别任务的鼓舞性成果。本研究证明自监督学习可通过统一范式实现,为未来研究中的创新实践应用开辟了新途径。