Sign language pre-training has gained increasing attention for its ability to enhance performance across various sign language understanding (SLU) tasks. However, existing methods often suffer from a gap between pre-training and fine-tuning, leading to suboptimal results. To address this, we propose Uni-Sign, a unified pre-training framework that eliminates the gap between pre-training and downstream SLU tasks through a large-scale generative pre-training strategy and a novel fine-tuning paradigm. First, we introduce CSL-News, a large-scale Chinese Sign Language (CSL) dataset containing 1,985 hours of video paired with textual annotations, which enables effective large-scale pre-training. Second, Uni-Sign unifies SLU tasks by treating downstream tasks as a single sign language translation (SLT) task during fine-tuning, ensuring seamless knowledge transfer between pre-training and fine-tuning. Furthermore, we incorporate a prior-guided fusion (PGF) module and a score-aware sampling strategy to efficiently fuse pose and RGB information, addressing keypoint inaccuracies and improving computational efficiency. Extensive experiments across multiple SLU benchmarks demonstrate that Uni-Sign achieves state-of-the-art performance across multiple downstream SLU tasks. Dataset and code are available at github.com/ZechengLi19/Uni-Sign.
翻译:手语预训练因其能够提升各种手语理解任务的性能而日益受到关注。然而,现有方法通常存在预训练与微调之间的鸿沟,导致结果欠佳。为解决此问题,我们提出了Uni-Sign,一个统一的预训练框架,它通过大规模生成式预训练策略和新颖的微调范式,消除了预训练与下游手语理解任务之间的隔阂。首先,我们引入了CSL-News,一个大规模中国手语数据集,包含1,985小时的视频及其对应的文本标注,为有效的大规模预训练提供了基础。其次,Uni-Sign通过在微调阶段将下游任务统一视为单一的手语翻译任务,确保了预训练与微调之间知识的无缝迁移。此外,我们引入了先验引导融合模块和分数感知采样策略,以高效融合姿态与RGB信息,解决关键点不准确的问题并提升计算效率。在多个手语理解基准上的广泛实验表明,Uni-Sign在多个下游手语理解任务上均取得了最先进的性能。数据集与代码可在github.com/ZechengLi19/Uni-Sign获取。