Continuous sign language recognition (CSLR) aims to promote active and accessible communication for the hearing impaired, by recognizing signs in untrimmed sign language videos to textual glosses sequentially. The key challenge of CSLR is how to achieve the cross-modality alignment between videos and gloss sequences. However, the current cross-modality paradigms of CSLR overlook using the glosses context to guide the video clips for global temporal context alignment, which further affects the visual to gloss mapping and is detrimental to recognition performance. To tackle this problem, we propose a novel Denoising-Diffusion global Alignment (DDA), which consists of a denoising-diffusion autoencoder and DDA loss function. DDA leverages diffusion-based global alignment techniques to align video with gloss sequence, facilitating global temporal context alignment. Specifically, DDA first proposes the auxiliary condition diffusion to conduct the gloss-part noised bimodal representations for video and gloss sequence. To address the problem of the recognition-oriented alignment knowledge represented in the diffusion denoising process cannot be feedback. The DDA further proposes the Denoising-Diffusion Autoencoder, which adds a decoder in the auxiliary condition diffusion to denoise the partial noisy bimodal representations via the designed DDA loss in self-supervised. In the denoising process, each video clip representation of video can be reliably guided to re-establish the global temporal context between them via denoising the gloss sequence representation. Experiments on three public benchmarks demonstrate that our DDA achieves state-of-the-art performances and confirm the feasibility of DDA for video representation enhancement.
翻译:连续手语识别(CSLR)旨在通过将未剪辑手语视频中的手语连续识别为文本标注,促进听障人士的主动和无障碍交流。CSLR的关键挑战在于如何实现视频与标注序列之间的跨模态对齐。然而,当前CSLR的跨模态范式忽略了利用标注语境引导视频片段进行全局时间语境对齐,这进一步影响了视觉到标注的映射,并对识别性能造成损害。为解决这一问题,我们提出了一种新颖的去噪扩散全局对齐方法(DDA),该方法包括一个去噪扩散自编码器和DDA损失函数。DDA利用基于扩散的全局对齐技术将视频与标注序列对齐,促进全局时间语境对齐。具体而言,DDA首先提出辅助条件扩散,为视频和标注序列生成局部噪声化的双模态表示。针对扩散去噪过程中表示的面向识别的对齐知识无法反馈的问题,DDA进一步提出去噪扩散自编码器,在辅助条件扩散中添加解码器,通过设计的自监督DDA损失对部分噪声化的双模态表示进行去噪。在去噪过程中,视频中的每个视频片段表示可通过去噪标注序列表示,被可靠地引导以重建它们之间的全局时间语境。在三个公开基准上的实验表明,我们的DDA达到了最先进的性能,并证实了DDA在视频表示增强中的可行性。