The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor's current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at https://github.com/k2-fsa/icefall.
翻译:Conformer已成为自动语音识别(ASR)中最流行的编码器模型。它在Transformer基础上添加卷积模块,以同时学习局部和全局依赖关系。本文介绍了一种更快速、内存效率更高且性能更优的Transformer变体——Zipformer。模型改进包括:1)类似U-Net的编码器结构,中间模块在较低帧率下运行;2)重组块结构,包含更多模块,并在其中重用注意力权重以提高效率;3)改进的LayerNorm变体——BiasNorm,可保留部分长度信息;4)新的激活函数SwooshR和SwooshL,性能优于Swish。我们还提出了一种新的优化器ScaledAdam,它根据每个张量的当前尺度缩放更新幅度,使相对变化保持稳定,并显式学习参数尺度。该优化器相比Adam实现了更快的收敛速度和更优的性能。在LibriSpeech、Aishell-1和WenetSpeech数据集上的广泛实验证明了我们提出的Zipformer相比其他先进ASR模型的有效性。我们的代码已在https://github.com/k2-fsa/icefall 公开。