Advancements in sign language processing have been hindered by a lack of sufficient data, impeding progress in recognition, translation, and production tasks. The absence of comprehensive sign language datasets across the world's sign languages has widened the gap in this field, resulting in a few sign languages being studied more than others, making this research area extremely skewed mostly towards sign languages from high-income countries. In this work we introduce a new large and highly multilingual dataset for sign language translation: JWSign. The dataset consists of 2,530 hours of Bible translations in 98 sign languages, featuring more than 1,500 individual signers. On this dataset, we report neural machine translation experiments. Apart from bilingual baseline systems, we also train multilingual systems, including some that take into account the typological relatedness of signed or spoken languages. Our experiments highlight that multilingual systems are superior to bilingual baselines, and that in higher-resource scenarios, clustering language pairs that are related improves translation quality.
翻译:手语处理领域的进展长期受限于数据不足,导致识别、翻译和生成任务的推进受阻。全球手语缺乏系统性数据集的现象加剧了该领域的发展失衡,少数几种手语获得更多研究关注,造成该研究方向严重偏向高收入国家的手语。本研究提出一个大规模、高多语言的手语翻译数据集——JWSign,涵盖98种手语的2530小时圣经翻译内容,包含超过1500位独立手语者。我们基于该数据集开展了神经机器翻译实验,除构建双语基线系统外,还训练了多语言系统,其中部分系统考虑了手语或口语的类型学关联性。实验结果表明,多语言系统的性能优于双语基线;在高资源场景下,对关联语言对进行聚类能有效提升翻译质量。