DCIM-AVSR : Efficient Audio-Visual Speech Recognition via Dual Conformer Interaction Module

Speech recognition is the technology that enables machines to interpret and process human speech, converting spoken language into text or commands. This technology is essential for applications such as virtual assistants, transcription services, and communication tools. The Audio-Visual Speech Recognition (AVSR) model enhances traditional speech recognition, particularly in noisy environments, by incorporating visual modalities like lip movements and facial expressions. While traditional AVSR models trained on large-scale datasets with numerous parameters can achieve remarkable accuracy, often surpassing human performance, they also come with high training costs and deployment challenges. To address these issues, we introduce an efficient AVSR model that reduces the number of parameters through the integration of a Dual Conformer Interaction Module (DCIM). In addition, we propose a pre-training method that further optimizes model performance by selectively updating parameters, leading to significant improvements in efficiency. Unlike conventional models that require the system to independently learn the hierarchical relationship between audio and visual modalities, our approach incorporates this distinction directly into the model architecture. This design enhances both efficiency and performance, resulting in a more practical and effective solution for AVSR tasks.

翻译：语音识别是使机器能够解释和处理人类语音，将口语转换为文本或指令的技术。该技术对于虚拟助手、转录服务和通信工具等应用至关重要。视听语音识别（AVSR）模型通过纳入唇部运动和面部表情等视觉模态，增强了传统语音识别，尤其在嘈杂环境中表现更佳。尽管在大型数据集上训练、具有大量参数的传统AVSR模型能够实现卓越的准确性，甚至常常超越人类表现，但它们也伴随着高昂的训练成本和部署挑战。为解决这些问题，我们引入了一种高效的AVSR模型，该模型通过集成双Conformer交互模块（DCIM）减少了参数数量。此外，我们提出了一种预训练方法，通过选择性更新参数进一步优化模型性能，从而显著提升了效率。与要求系统独立学习音频和视觉模态之间层次关系的传统模型不同，我们的方法将这种差异性直接纳入模型架构中。这种设计同时提升了效率和性能，为AVSR任务提供了一个更实用且有效的解决方案。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日