Speech recognition is the technology that enables machines to interpret and process human speech, converting spoken language into text or commands. This technology is essential for applications such as virtual assistants, transcription services, and communication tools. The Audio-Visual Speech Recognition (AVSR) model enhances traditional speech recognition, particularly in noisy environments, by incorporating visual modalities like lip movements and facial expressions. While traditional AVSR models trained on large-scale datasets with numerous parameters can achieve remarkable accuracy, often surpassing human performance, they also come with high training costs and deployment challenges. To address these issues, we introduce an efficient AVSR model that reduces the number of parameters through the integration of a Dual Conformer Interaction Module (DCIM). In addition, we propose a pre-training method that further optimizes model performance by selectively updating parameters, leading to significant improvements in efficiency. Unlike conventional models that require the system to independently learn the hierarchical relationship between audio and visual modalities, our approach incorporates this distinction directly into the model architecture. This design enhances both efficiency and performance, resulting in a more practical and effective solution for AVSR tasks.
翻译:语音识别是使机器能够解释和处理人类语音,将口语转换为文本或指令的技术。该技术对于虚拟助手、转录服务和通信工具等应用至关重要。视听语音识别(AVSR)模型通过纳入唇部运动和面部表情等视觉模态,增强了传统语音识别,尤其在嘈杂环境中表现更佳。尽管在大型数据集上训练、具有大量参数的传统AVSR模型能够实现卓越的准确性,甚至常常超越人类表现,但它们也伴随着高昂的训练成本和部署挑战。为解决这些问题,我们引入了一种高效的AVSR模型,该模型通过集成双Conformer交互模块(DCIM)减少了参数数量。此外,我们提出了一种预训练方法,通过选择性更新参数进一步优化模型性能,从而显著提升了效率。与要求系统独立学习音频和视觉模态之间层次关系的传统模型不同,我们的方法将这种差异性直接纳入模型架构中。这种设计同时提升了效率和性能,为AVSR任务提供了一个更实用且有效的解决方案。