Spoken language glossification (SLG) aims to translate the spoken language text into the sign language gloss, i.e., a written record of sign language. In this work, we present a framework named $S$emi-$S$upervised $S$poken $L$anguage $G$lossification ($S^3$LG) for SLG. To tackle the bottleneck of limited parallel data in SLG, our $S^3$LG incorporates large-scale monolingual spoken language text into SLG training. The proposed framework follows the self-training structure that iteratively annotates and learns from pseudo labels. Considering the lexical similarity and syntactic difference between sign language and spoken language, our $S^3$LG adopts both the rule-based heuristic and model-based approach for auto-annotation. During training, we randomly mix these complementary synthetic datasets and mark their differences with a special token. As the synthetic data may be less quality, the $S^3$LG further leverages consistency regularization to reduce the negative impact of noise in the synthetic data. Extensive experiments are conducted on public benchmarks to demonstrate the effectiveness of the $S^3$LG. Our code is available at \url{https://github.com/yaohj11/S3LG}.
翻译:口语语言注释化(SLG)旨在将口语文本翻译为手语注释,即手语的书面记录。本文提出了一种名为$S$emi-$S$upervised $S$poken $L$anguage $G$lossification($S^3$LG)的框架用于SLG。为应对SLG中平行数据有限的瓶颈,我们的$S^3$LG将大规模单语口语文本纳入SLG训练。所提框架遵循自训练结构,通过迭代标注和从伪标签中学习。考虑到手语与口语之间的词汇相似性和句法差异,我们的$S^3$LG同时采用基于规则的启发式方法和基于模型的方法进行自动标注。训练过程中,我们随机混合这些互补的合成数据集,并用特殊标记区分其差异。由于合成数据质量可能较低,$S^3$LG进一步利用一致性正则化来降低合成数据中噪声的负面影响。在公开基准上进行了大量实验,证明了$S^3$LG的有效性。我们的代码已开源在\url{https://github.com/yaohj11/S3LG}。