In this work, we are dedicated to leveraging the BERT pre-training success and modeling the domain-specific statistics to fertilize the sign language recognition~(SLR) model. Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone in a frame-wise manner. Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence, which learns the hierarchical correlation context cues among internal and external triplet units. Notably, different from the highly semantic word token in BERT, the pose unit is a low-level signal originally located in continuous space, which prevents the direct adoption of the BERT cross-entropy objective. To this end, we bridge this semantic gap via coupling tokenization of the triplet unit. It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture/body state. After pre-training, we fine-tune the pre-trained encoder on the downstream SLR task, jointly with the newly added task-specific layer. Extensive experiments are conducted to validate the effectiveness of our proposed method, achieving new state-of-the-art performance on all four benchmarks with a notable gain.
翻译:摘要:本文致力于利用BERT预训练的成功经验,并建模领域特定统计特征以促进手语识别模型的发展。考虑到手部与身体在手语表达中的主导作用,我们将其组织为姿态三元组单元,并以帧级方式输入Transformer主干网络。预训练通过从受损输入序列中重建掩蔽三元组单元实现,从而学习三元组单元内部及之间的层次化上下文关联线索。值得注意的是,与BERT中高度语义化的词元不同,姿态单元本质上是位于连续空间的低级信号,这阻碍了直接采用BERT的交叉熵目标函数。为此,我们通过耦合分词机制弥合这一语义鸿沟,该机制从姿态三元组单元中自适应提取离散伪标签,用以表示语义化手势/身体状态。预训练完成后,我们将预训练编码器与新增的任务特定层联合,在下游手语识别任务上进行微调。通过大量实验验证了所提方法的有效性,在四个基准数据集上均实现了显著性能提升,达到新的最佳水平。