In this work, we are dedicated to leveraging the BERT pre-training success and modeling the domain-specific statistics to fertilize the sign language recognition~(SLR) model. Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone in a frame-wise manner. Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence, which learns the hierarchical correlation context cues among internal and external triplet units. Notably, different from the highly semantic word token in BERT, the pose unit is a low-level signal originally located in continuous space, which prevents the direct adoption of the BERT cross-entropy objective. To this end, we bridge this semantic gap via coupling tokenization of the triplet unit. It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture/body state. After pre-training, we fine-tune the pre-trained encoder on the downstream SLR task, jointly with the newly added task-specific layer. Extensive experiments are conducted to validate the effectiveness of our proposed method, achieving new state-of-the-art performance on all four benchmarks with a notable gain.
翻译:本文致力于借助BERT预训练的成功经验,并建模领域特定统计信息以促进手语识别模型的发展。考虑到手和躯干在手语表达中的主导地位,我们将它们组织成姿态三元组单元,并以逐帧方式输入Transformer主干网络。预训练通过从损坏的输入序列中重建被屏蔽的三元组单元来实现,从而学习内部与外部三元组单元间的层级关联上下文线索。值得注意的是,与BERT中高度语义化的词标记不同,姿态单元本质上是位于连续空间中的低级信号,这阻碍了直接采用BERT的交叉熵目标函数。为此,我们通过三元组单元的耦合标记化来弥合这一语义鸿沟。该方法从姿态三元组单元中自适应提取离散伪标签,该标签表征语义化的手势/躯干状态。预训练完成后,我们在下游手语识别任务上联合新添加的任务特定层对预训练编码器进行微调。通过大量实验验证了所提方法的有效性,在四个基准数据集上均以显著提升实现了新的最优性能。