Online speech recognition, where the model only accesses context to the left, is an important and challenging use case for ASR systems. In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4), a family of models that provide a parameter-efficient way of accessing arbitrarily long left context. We performed systematic ablation studies to compare variants of S4 models and propose two novel approaches that combine them with convolutions. We found that the most effective design is to stack a small S4 using real-valued recurrent weights with a local convolution, allowing them to work complementarily. Our best model achieves WERs of 4.01%/8.53% on test sets from Librispeech, outperforming Conformers with extensively tuned convolution.
翻译:在线语音识别(仅能访问左侧上下文的模型)是自动语音识别系统重要且具挑战性的应用场景。本文研究通过引入结构化状态空间序列模型(S4)来增强在线ASR的神经编码器——该模型家族能以参数高效的方式获取任意长度的左侧上下文信息。我们通过系统性消融实验比较了S4模型的多种变体,并提出两种将S4与卷积相结合的新方法。研究发现,最有效的方案是将采用实值循环权重的小型S4模块与局部卷积堆叠使用,使两者形成互补机制。最佳模型在Librispeech测试集上实现了4.01%/8.53%的词错误率(WER),超越了经过广泛调优卷积的Conformer模型。