The local and global features are both essential for automatic speech recognition (ASR). Many recent methods have verified that simply combining local and global features can further promote ASR performance. However, these methods pay less attention to the interaction of local and global features, and their series architectures are rigid to reflect local and global relationships. To address these issues, this paper proposes InterFormer for interactive local and global features fusion to learn a better representation for ASR. Specifically, we combine the convolution block with the transformer block in a parallel design. Besides, we propose a bidirectional feature interaction module (BFIM) and a selective fusion module (SFM) to implement the interaction and fusion of local and global features, respectively. Extensive experiments on public ASR datasets demonstrate the effectiveness of our proposed InterFormer and its superior performance over the other Transformer and Conformer models.
翻译:局部与全局特征对于自动语音识别(ASR)均至关重要。近期诸多方法证实,简单组合局部与全局特征即可进一步提升ASR性能。然而,这些方法较少关注局部与全局特征的交互,其串行架构难以有效反映局部与全局之间的关系。针对上述问题,本文提出InterFormer模型,通过交互式局部与全局特征融合为ASR学习更优的表示。具体而言,我们以并行设计方式融合卷积模块与Transformer模块。此外,我们分别提出双向特征交互模块(BFIM)和选择性融合模块(SFM),分别实现局部与全局特征的交互与融合。在公开ASR数据集上的大量实验表明,所提出的InterFormer具有显著有效性,且性能优于其他Transformer和Conformer模型。