The objective of this work is the effective extraction of spatial and dynamic features for Continuous Sign Language Recognition (CSLR). To accomplish this, we utilise a two-pathway SlowFast network, where each pathway operates at distinct temporal resolutions to separately capture spatial (hand shapes, facial expressions) and dynamic (movements) information. In addition, we introduce two distinct feature fusion methods, carefully designed for the characteristics of CSLR: (1) Bi-directional Feature Fusion (BFF), which facilitates the transfer of dynamic semantics into spatial semantics and vice versa; and (2) Pathway Feature Enhancement (PFE), which enriches dynamic and spatial representations through auxiliary subnetworks, while avoiding the need for extra inference time. As a result, our model further strengthens spatial and dynamic representations in parallel. We demonstrate that the proposed framework outperforms the current state-of-the-art performance on popular CSLR datasets, including PHOENIX14, PHOENIX14-T, and CSL-Daily.
翻译:本工作的目标是有效提取连续手语识别(CSLR)中的空间与动态特征。为此,我们采用双通路SlowFast网络,其中每条通路以不同的时间分辨率运行,分别捕获空间信息(手型、面部表情)与动态信息(运动)。此外,我们针对CSLR特性提出了两种特征融合方法:(1)双向特征融合(BFF),促进动态语义向空间语义的转移及反向转移;(2)通路特征增强(PFE),通过辅助子网络丰富动态与空间表征,且无需增加额外推理时间。由此,我们的模型并行增强了空间与动态表征。实验表明,所提出的框架在主流CSLR数据集(包括PHOENIX14、PHOENIX14-T和CSL-Daily)上均优于当前最优性能。