This paper presents an in-depth study on a Sequentially Sampled Chunk Conformer, SSC-Conformer, for streaming End-to-End (E2E) ASR. The SSC-Conformer first demonstrates the significant performance gains from using the sequentially sampled chunk-wise multi-head self-attention (SSC-MHSA) in the Conformer encoder by allowing efficient cross-chunk interactions while keeping linear complexities. Furthermore, it explores taking advantage of chunked convolution to make use of the chunk-wise future context and integrates with casual convolution in the convolution layers to further reduce CER. We verify the proposed SSC-Conformer on the AISHELL-1 benchmark and experimental results show that a state-of-the-art performance for streaming E2E ASR is achieved with CER 5.33% without LM rescoring. And, owing to its linear complexity, the SSC-Conformer can train with large batch sizes and infer more efficiently.
翻译:本文深入研究了用于流式端到端(E2E)语音识别的顺序采样块Conformer(SSC-Conformer)。SSC-Conformer首先通过在Conformer编码器中采用顺序采样的块级多头自注意力(SSC-MHSA),在保持线性复杂度的同时实现高效的跨块交互,从而显著提升性能。其次,它探索利用块级卷积来获取块级未来上下文,并将其与卷积层中的因果卷积相结合,以进一步降低词错误率(CER)。我们在AISHELL-1基准上验证了所提出的SSC-Conformer,实验结果表明,在无需语言模型重评分的情况下,该模型实现了流式端到端语音识别的当前最佳性能,CER为5.33%。此外,由于其线性复杂度,SSC-Conformer能够以大批量大小进行训练并更高效地进行推理。