The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at https://github.com/k2-fsa/icefall.
翻译:Conformer已成为自动语音识别(ASR)中最流行的编码器模型。它在Transformer基础上引入卷积模块,以同时学习局部和全局依赖关系。本文提出一种更快速、内存效率更高且性能更优的Transformer变体——Zipformer。模型层面的改进包括:1)采用类似U-Net的编码器结构,其中间层在更低的帧率下运行;2)重组模块结构,增加更多子模块,并通过复用注意力权重提升效率;3)提出改进版层归一化BiasNorm,可保留部分长度信息;4)新激活函数SwooshR和SwooshL相较Swish表现更优。此外,我们提出新优化器ScaledAdam,其根据每个张量的当前尺度缩放更新量以保持相对变化幅度一致,并能显式学习参数尺度,相比Adam实现了更快的收敛速度和更好的性能。在LibriSpeech、Aishell-1和WenetSpeech数据集上的大量实验表明,我们提出的Zipformer相较于其他先进ASR模型具有显著优势。代码已在https://github.com/k2-fsa/icefall 公开。