State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks, rivalling and outperforming many attention-based approaches. In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms, where parallel heads are taught to learn local and global temporal dynamics on sequence data. As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus. Furthermore, we augment the transformer block with MH-SSMs layers, referred to as the Stateformer, achieving state-of-the-art performance on the LibriSpeech task, with word error rates of 1.76\%/4.37\% on the development and 1.91\%/4.36\% on the test sets without using an external language model.
翻译:状态空间模型(SSMs)近期在小规模序列和语言建模任务中展现出令人瞩目的性能,能够与众多基于注意力机制的方法相抗衡甚至超越它们。本文提出了一种配备特殊门控机制的多头状态空间(MH-SSM)架构,其中并行头被训练用于学习序列数据上的局部和全局时态动态。作为Transformer编码器中多头注意力机制的即插即用替代方案,该模型在LibriSpeech语音识别语料库上显著优于Transformer转换器。此外,我们在Transformer块中集成MH-SSM层,提出名为Stateformer的改进架构,在LibriSpeech任务上实现了最优性能,在开发集和测试集上分别取得了1.76\%/4.37\%和1.91\%/4.36\%的词错误率,且未使用外部语言模型。