Online speech recognition, where the model only accesses context to the left, is an important and challenging use case for ASR systems. In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4), which are a family of models that provide a parameter-efficient way of accessing arbitrarily long left context. We perform systematic ablation studies to compare variants of S4 models and propose two novel approaches that combine them with convolutions. We find that the most effective design is to stack a small S4 using real-valued recurrent weights with a local convolution, allowing them to work complementarily. Our best model achieves WERs of 4.01%/8.53% on test sets from Librispeech, outperforming Conformers with extensively tuned convolution.
翻译:在线语音识别(模型仅能访问左侧上下文)是自动语音识别(ASR)系统中的一个重要且具有挑战性的应用场景。本文研究通过引入结构化状态空间序列模型(S4)来增强在线ASR的神经编码器。S4模型族提供了一种参数高效的方式,能够访问任意长度的左侧上下文。我们通过系统的消融实验比较了S4模型的多种变体,并提出了两种将S4与卷积相结合的新方法。研究发现,最有效的设计是将使用实值循环权重的小型S4与局部卷积堆叠,使两者形成互补。最佳模型在Librispeech测试集上实现了4.01%/8.53%的词错误率(WER),优于经过广泛调优卷积的Conformer模型。