The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at https://github.com/k2-fsa/icefall.
翻译:Conformer已成为自动语音识别(ASR)中最流行的编码器模型。它在Transformer中引入卷积模块,以同时学习局部和全局依赖关系。本文描述了一种更快、内存效率更高且性能更优的Transformer变体——Zipformer。建模改进包括:1)类U-Net编码器结构,其中间堆叠层以更低帧率运行;2)重组模块结构,引入更多内部组件并在其中复用注意力权重以提高效率;3)改进版LayerNorm——BiasNorm,可保留部分长度信息;4)新型激活函数SwooshR和SwooshL,其性能优于Swish。我们还提出一种名为ScaledAdam的新型优化器,该优化器根据每个张量的当前尺度缩放更新量以保持相对变化率稳定,并显式学习参数尺度。与Adam相比,它能实现更快的收敛速度和更优性能。在LibriSpeech、Aishell-1和WenetSpeech数据集上的大量实验表明,我们提出的Zipformer优于其他最先进的ASR模型。我们的代码已公开于https://github.com/k2-fsa/icefall。