Hourglass-AVSR: Down-Up Sampling-based Computational Efficiency Model for Audio-Visual Speech Recognition

Recently audio-visual speech recognition (AVSR), which better leverages video modality as additional information to extend automatic speech recognition (ASR), has shown promising results in complex acoustic environments. However, there is still substantial space to improve as complex computation of visual modules and ineffective fusion of audio-visual modalities. To eliminate these drawbacks, we propose a down-up sampling-based AVSR model (Hourglass-AVSR) to enjoy high efficiency and performance, whose time length is scaled during the intermediate processing, resembling an hourglass. Firstly, we propose a context and residual aware video upsampling approach to improve the recognition performance, which utilizes contextual information from visual representations and captures residual information between adjacent video frames. Secondly, we introduce a visual-audio alignment approach during the upsampling by explicitly incorporating boundary constraint loss. Besides, we propose a cross-layer attention fusion to capture the modality dependencies within each visual encoder layer. Experiments conducted on the MISP-AVSR dataset reveal that our proposed Hourglass-AVSR model outperforms ASR model by 12.9% and 20.8% relative concatenated minimum permutation character error rate (cpCER) reduction on far-field and middle-field test sets, respectively. Moreover, compared to other state-of-the-art AVSR models, our model exhibits the highest improvement in cpCER for the visual module. Furthermore, on the benefit of our down-up sampling approach, Hourglass-AVSR model reduces 54.2% overall computation costs with minor performance degradation.

翻译：近期，音视频语音识别（AVSR）通过利用视频模态作为额外信息来扩展自动语音识别（ASR），在复杂声学环境中展现出显著成效。然而，由于视觉模块计算复杂且音视频模态融合效率低下，其性能仍有较大提升空间。为解决这些问题，我们提出了一种基于下-上采样的AVSR模型（Hourglass-AVSR），该模型在中间处理过程中缩放时间长度（形似沙漏），从而兼具高效性与优越性能。首先，我们提出一种上下文与残差感知的视频上采样方法，通过利用视觉表征中的上下文信息并捕捉相邻视频帧间的残差信息，提升识别性能。其次，我们在上采样过程中引入显式边界约束损失的视觉-音频对齐方法。此外，我们提出跨层注意力融合机制，以捕获每个视觉编码器层内的模态依赖关系。在MISP-AVSR数据集上的实验表明，所提Hourglass-AVSR模型在远场和中场测试集上，相较于ASR模型分别实现了12.9%和20.8%的相对级联最小词错率（cpCER）降低。与其它先进AVSR模型相比，本模型的视觉模块在cpCER指标上提升最为显著。此外，得益于下-上采样方法，Hourglass-AVSR模型在仅带来微小性能损失的情况下，整体计算成本降低了54.2%。