As a key to social good, continuous sign language recognition (CSLR) aims to promote active and accessible communication for the hearing impaired. Current CSLR research adopts a cross-modality alignment scheme to learn the mapping relationship between "video clip-textual gloss". However, this local alignment method, especially with weak data annotation, ignores the contextual information of modalities and directly reduces the generalization of visual features. To this end, we propose a novel Denoising-Diffusion global Alignment scheme (DDA), which focuses on modeling the mapping of the "entire video-gloss sequence". DDA consists of a partial noising process strategy and a denoising-diffusion autoencoder. The former is used to achieve efficient guidance of the text modality to the visual modality; the latter learns the global alignment information of the two modalities in a denoising manner. Our DDA confirms the feasibility of diffusion models for visual representation learning in CSLR. Experiments on three public benchmarks demonstrate that our method achieves state-of-the-art performances. Furthermore, the proposed method can be a plug-and-play optimization to generalize other CSLR methods.
翻译:作为一项促进社会福祉的关键技术,连续手语识别(CSLR)旨在为听障人士实现主动且无障碍的沟通。当前CSLR研究采用跨模态对齐方案来学习"视频片段-文本注释"的映射关系。然而,这种局部对齐方法(尤其在数据标注薄弱时)忽略了模态的上下文信息,直接降低了视觉特征的泛化能力。为此,我们提出了一种新颖的全局去噪扩散对齐方案(DDA),该方案聚焦于建模"完整视频-注释序列"的映射关系。DDA包含部分噪声注入策略与去噪扩散自编码器:前者用于实现文本模态对视觉模态的高效引导;后者以去噪方式学习两种模态的全局对齐信息。我们的DDA验证了扩散模型在CSLR视觉表征学习中的可行性。在三个公开基准上的实验表明,该方法达到了最先进的性能。此外,所提方法可作为即插即用的优化方案推广至其他CSLR方法。