Convolutions have become essential in state-of-the-art end-to-end Automatic Speech Recognition~(ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itself has been far less explored. Towards this, we introduce Multi-Convformer that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating. This helps in improved modeling of local dependencies at varying granularities. Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient. We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate~(WER) improvements.
翻译:卷积因其对局部上下文的高效建模能力,已成为当前端到端自动语音识别(ASR)系统的关键组件。值得注意的是,在Conformer架构中引入卷积模块,使其性能显著优于基于原始Transformer的ASR系统。尽管Conformer中除卷积模块外的其他组件已被广泛研究,但对卷积模块本身的改进却鲜有探索。为此,我们提出Multi-Convformer,通过在Conformer的卷积模块中引入多尺度卷积核并结合门控机制,实现了对不同粒度局部依赖关系的增强建模。该模型在性能上可与CgMLP、E-Branchformer等现有Conformer变体相媲美,同时具有更高的参数效率。我们在四个不同数据集和三种建模范式下,将本方法与Conformer及其变体进行实证比较,结果显示词错误率(WER)最高可相对降低8%。