Convolutions have become essential in state-of-the-art end-to-end Automatic Speech Recognition~(ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itself has been far less explored. Towards this, we introduce Multi-Convformer that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating. This helps in improved modeling of local dependencies at varying granularities. Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient. We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate~(WER) improvements.
翻译:卷积因其对局部上下文的高效建模能力,已成为当前最先进的端到端自动语音识别系统的核心组件。值得注意的是,在Conformer中使用卷积模块,使其性能显著优于基于原始Transformer的语音识别系统。尽管Conformer中除卷积模块外的其他组件已被广泛重新审视,但对卷积模块本身的修改却鲜有探索。为此,我们提出了多卷积核Conformer,该模型在Conformer的卷积模块中结合门控机制使用了多个卷积核。这有助于在不同粒度上更好地建模局部依赖关系。我们的模型在性能上可与现有的Conformer变体(如CgMLP和E-Branchformer)相媲美,同时具有更高的参数效率。我们在四个不同数据集和三种不同建模范式下,将本方法与Conformer及其变体进行了实证比较,结果显示词错误率相对降低了最高达8%。