Tailored Design of Audio-Visual Speech Recognition Models using Branchformers

Recent advances in Audio-Visual Speech Recognition (AVSR) have led to unprecedented achievements in the field, improving the robustness of this type of system in adverse, noisy environments. In most cases, this task has been addressed through the design of models composed of two independent encoders, each dedicated to a specific modality. However, while recent works have explored unified audio-visual encoders, determining the optimal cross-modal architecture remains an ongoing challenge. Furthermore, such approaches often rely on models comprising vast amounts of parameters and high computational cost training processes. In this paper, we aim to bridge this research gap by introducing a novel audio-visual framework. Our proposed method constitutes, to the best of our knowledge, the first attempt to harness the flexibility and interpretability offered by encoder architectures, such as the Branchformer, in the design of parameter-efficient AVSR systems. To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder based on the layer-level branch scores provided by the modality-specific models. Extensive experiments on English and Spanish AVSR benchmarks covering multiple data conditions and scenarios demonstrated the effectiveness of our proposed method. Results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates while significantly reducing the model complexity w.r.t. the prevalent approach in the field. Code and pre-trained models are available at https://github.com/david-gimeno/tailored-avsr.

翻译：视听语音识别（AVSR）领域的最新进展取得了前所未有的成就，提升了此类系统在恶劣嘈杂环境下的鲁棒性。在大多数情况下，该任务通过设计由两个独立编码器组成的模型来解决，每个编码器专用于特定模态。然而，尽管近期研究探索了统一的视听编码器，但确定最优的跨模态架构仍是一个持续挑战。此外，这类方法通常依赖于包含大量参数和高计算成本训练过程的模型。本文旨在通过引入一种新颖的视听框架来弥合这一研究空白。据我们所知，所提出的方法首次尝试在参数高效的AVSR系统设计中，利用编码器架构（如Branchformer）提供的灵活性和可解释性。更具体地说，该框架包含两个步骤：首先估计纯音频和纯视频系统，然后基于模态特定模型提供的层级分支分数，设计定制的视听统一编码器。在涵盖多种数据条件和场景的英语及西班牙语AVSR基准测试上进行的大量实验证明了所提方法的有效性。结果表明，我们的定制化AVSR系统能够达到最先进的识别率，同时相较于该领域的主流方法显著降低了模型复杂度。代码和预训练模型可在https://github.com/david-gimeno/tailored-avsr获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日