Recently, self-attention-based transformers and conformers have been introduced as alternatives to RNNs for ASR acoustic modeling. Nevertheless, the full-sequence attention mechanism is non-streamable and computationally expensive, thus requiring modifications, such as chunking and caching, for efficient streaming ASR. In this paper, we propose to apply RWKV, a variant of linear attention transformer, to streaming ASR. RWKV combines the superior performance of transformers and the inference efficiency of RNNs, which is well-suited for streaming ASR scenarios where the budget for latency and memory is restricted. Experiments on varying scales (100h - 10000h) demonstrate that RWKV-Transducer and RWKV-Boundary-Aware-Transducer achieve comparable to or even better accuracy compared with chunk conformer transducer, with minimal latency and inference memory cost.
翻译:近期,基于自注意力机制的Transformer和Conformer作为RNN的替代方案被引入语音识别声学建模领域。然而,全序列注意力机制无法实现流式处理且计算开销巨大,因此需要采用分块缓存等改进方案以实现高效流式语音识别。本文提出将线性注意力Transformer变体RWKV应用于流式语音识别。RWKV融合了Transformer的优越性能与RNN的推理效率,特别适用于延迟和内存预算受限的流式语音识别场景。在100小时至10000小时多尺度数据集上的实验表明,相比分块Conformer传感器模型,RWKV-传感器与RWKV-边界感知传感器在保持极小延迟和推理内存开销的同时,实现了可比较甚至更优的准确率。