The local and global features are both essential for automatic speech recognition (ASR). Many recent methods have verified that simply combining local and global features can further promote ASR performance. However, these methods pay less attention to the interaction of local and global features, and their series architectures are rigid to reflect local and global relationships. To address these issues, this paper proposes InterFormer for interactive local and global features fusion to learn a better representation for ASR. Specifically, we combine the convolution block with the transformer block in a parallel design. Besides, we propose a bidirectional feature interaction module (BFIM) and a selective fusion module (SFM) to implement the interaction and fusion of local and global features, respectively. Extensive experiments on public ASR datasets demonstrate the effectiveness of our proposed InterFormer and its superior performance over the other Transformer and Conformer models.
翻译:局部与全局特征对于自动语音识别(ASR)均至关重要。近期诸多方法已证实,简单结合局部与全局特征可进一步提升ASR性能。然而,这些方法较少关注局部与全局特征的交互性,且其串行架构难以灵活反映局部与全局的关联。针对上述问题,本文提出InterFormer,通过局部与全局特征的交互融合学习更优的ASR表示。具体而言,我们采用卷积模块与Transformer模块的并行设计;此外,分别提出双向特征交互模块(BFIM)和选择性融合模块(SFM)以实现局部与全局特征的交互与融合。在公开ASR数据集上的大量实验表明,所提出的InterFormer具有显著有效性,且其性能优于其他Transformer和Conformer模型。