Different from traditional video retrieval, sign language retrieval is more biased towards understanding the semantic information of human actions contained in video clips. Previous works typically only encode RGB videos to obtain high-level semantic features, resulting in local action details drowned in a large amount of visual information redundancy. Furthermore, existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training, and adopt offline RGB encoder instead, leading to suboptimal feature representation. To address these issues, we propose a novel sign language representation framework called Semantically Enhanced Dual-Stream Encoder (SEDS), which integrates Pose and RGB modalities to represent the local and global information of sign language videos. Specifically, the Pose encoder embeds the coordinates of keypoints corresponding to human joints, effectively capturing detailed action features. For better context-aware fusion of two video modalities, we propose a Cross Gloss Attention Fusion (CGAF) module to aggregate the adjacent clip features with similar semantic information from intra-modality and inter-modality. Moreover, a Pose-RGB Fine-grained Matching Objective is developed to enhance the aggregated fusion feature by contextual matching of fine-grained dual-stream features. Besides the offline RGB encoder, the whole framework only contains learnable lightweight networks, which can be trained end-to-end. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods on various datasets.
翻译:与传统视频检索不同,手语检索更侧重于理解视频片段中所包含人体动作的语义信息。现有方法通常仅对RGB视频进行编码以获取高层语义特征,导致局部动作细节淹没于大量视觉信息冗余中。此外,基于RGB的手语检索方法在端到端训练中面临稠密视觉数据嵌入的巨大内存开销,因而常采用离线RGB编码器,导致特征表示次优。为解决这些问题,我们提出了一种名为语义增强双流编码器(SEDS)的新型手语表示框架,该框架融合姿态与RGB模态来表征手语视频的局部与全局信息。具体而言,姿态编码器嵌入人体关节对应关键点的坐标,有效捕捉细节动作特征。为实现两种视频模态的上下文感知融合,我们提出了跨词目注意力融合(CGAF)模块,用于聚合来自模态内与模态间具有相似语义信息的相邻片段特征。此外,我们开发了姿态- RGB细粒度匹配目标,通过细粒度双流特征的上下文匹配来增强聚合后的融合特征。除离线RGB编码器外,整个框架仅包含可学习的轻量级网络,能够进行端到端训练。大量实验表明,我们的框架在多个数据集上显著优于现有最优方法。