Computation and Parameter Efficient Multi-Modal Fusion Transformer for Cued Speech Recognition

Cued Speech (CS) is a pure visual coding method used by hearing-impaired people that combines lip reading with several specific hand shapes to make the spoken language visible. Automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text, which can help hearing-impaired people to communicate effectively. The visual information of CS contains lip reading and hand cueing, thus the fusion of them plays an important role in ACSR. However, most previous fusion methods struggle to capture the global dependency present in long sequence inputs of multi-modal CS data. As a result, these methods generally fail to learn the effective cross-modal relationships that contribute to the fusion. Recently, attention-based transformers have been a prevalent idea for capturing the global dependency over the long sequence in multi-modal fusion, but existing multi-modal fusion transformers suffer from both poor recognition accuracy and inefficient computation for the ACSR task. To address these problems, we develop a novel computation and parameter efficient multi-modal fusion transformer by proposing a novel Token-Importance-Aware Attention mechanism (TIAA), where a token utilization rate (TUR) is formulated to select the important tokens from the multi-modal streams. More precisely, TIAA firstly models the modality-specific fine-grained temporal dependencies over all tokens of each modality, and then learns the efficient cross-modal interaction for the modality-shared coarse-grained temporal dependencies over the important tokens of different modalities. Besides, a light-weight gated hidden projection is designed to control the feature flows of TIAA. The resulting model, named Economical Cued Speech Fusion Transformer (EcoCued), achieves state-of-the-art performance on all existing CS datasets, compared with existing transformer-based fusion methods and ACSR fusion methods.

翻译：摘要：提示语音（CS）是一种听力障碍者使用的纯视觉编码方法，通过结合唇读与若干特定手形使口语可视化。自动提示语音识别（ACSR）旨在将语音的视觉线索转录为文本，从而帮助听力障碍者进行有效沟通。CS的视觉信息包含唇读与手部提示，因此两者的融合在ACSR中至关重要。然而，以往多数融合方法难以捕捉多模态CS数据长序列输入中的全局依赖关系，导致这些方法通常无法学习到有助于融合的有效跨模态关联。近年来，基于注意力机制的Transformer已成为捕获多模态融合中长序列全局依赖的主流思路，但现有用于ACSR任务的多模态融合Transformer存在识别精度低与计算效率差的双重问题。为解决这些问题，我们提出一种新型计算与参数高效的多模态融合Transformer，通过设计新型令牌重要性感知注意力机制（TIAA），其中定义了令牌利用率（TUR）以从多模态流中选择重要令牌。具体而言，TIAA首先对每个模态的所有令牌建模模态特定的细粒度时序依赖，然后对不同模态重要令牌上的模态共享粗粒度时序依赖学习高效的跨模态交互。此外，我们设计了轻量级门控隐藏投影以控制TIAA的特征流。该模型命名为经济型提示语音融合Transformer（EcoCued），与现有基于Transformer的融合方法及ACSR融合方法相比，在所有现有CS数据集上均达到了最优性能。