Automatic Cued Speech Recognition (ACSR) provides an intelligent human-machine interface for visual communications, where the Cued Speech (CS) system utilizes lip movements and hand gestures to code spoken language for hearing-impaired people. Previous ACSR approaches often utilize direct feature concatenation as the main fusion paradigm. However, the asynchronous modalities i.e., lip, hand shape and hand position) in CS may cause interference for feature concatenation. To address this challenge, we propose a transformer based cross-modal mutual learning framework to prompt multi-modal interaction. Compared with the vanilla self-attention, our model forces modality-specific information of different modalities to pass through a modality-invariant codebook, collating linguistic representations for tokens of each modality. Then the shared linguistic knowledge is used to re-synchronize multi-modal sequences. Moreover, we establish a novel large-scale multi-speaker CS dataset for Mandarin Chinese. To our knowledge, this is the first work on ACSR for Mandarin Chinese. Extensive experiments are conducted for different languages i.e., Chinese, French, and British English). Results demonstrate that our model exhibits superior recognition performance to the state-of-the-art by a large margin.
翻译:自动提示语音识别(ACSR)为视觉通信提供了智能人机界面,其中提示语音(CS)系统利用唇部动作和手势为听力障碍人士编码口语。以往的ACSR方法通常将直接特征拼接作为主要融合范式。然而,CS中的异步模态(即唇部、手形和手部位置)可能导致特征拼接的干扰。为解决这一挑战,我们提出了一种基于Transformer的跨模态相互学习框架,以促进多模态交互。与普通自注意力相比,我们的模型强制不同模态的模态特定信息通过一个模态不变码本,为每种模态的令牌整理语言表示。然后,共享的语言知识用于重新同步多模态序列。此外,我们为中文构建了一个新颖的大规模多说话人CS数据集。据我们所知,这是首个针对中文的ACSR工作。针对不同语言(即中文、法语和英式英语)进行了大量实验。结果表明,我们的模型在识别性能上大幅超越现有最先进方法。