Sign language processing has traditionally relied on task-specific models, limiting the potential for transfer learning across tasks. Pre-training methods for sign language have typically focused on either supervised pre-training, which cannot take advantage of unlabeled data, or context-independent (frame or video segment) representations, which ignore the effects of relationships across time in sign language. We introduce SHuBERT (Sign Hidden-Unit BERT), a self-supervised contextual representation model learned from approximately 1,000 hours of American Sign Language video. SHuBERT adapts masked token prediction objectives to multi-stream visual sign language input, learning to predict multiple targets corresponding to clustered hand, face, and body pose streams. SHuBERT achieves state-of-the-art performance across multiple tasks including sign language translation, isolated sign language recognition, and fingerspelling detection.
翻译:传统手语处理通常依赖于任务特定模型,这限制了跨任务迁移学习的潜力。现有的手语预训练方法主要集中于监督式预训练(无法利用未标注数据)或上下文无关(单帧或视频片段)表征(忽略了手语跨时间关系的影响)。本文提出SHuBERT(Sign Hidden-Unit BERT),这是一种从约1,000小时美国手语视频中学习的自监督上下文表征模型。SHuBERT将掩码标记预测目标适配于多流视觉手语输入,通过学习预测对应于聚类后的手部、面部及身体姿态流的多个目标。SHuBERT在手语翻译、孤立手语识别和指拼检测等多个任务上均达到了最先进的性能水平。