Millimeter wave (mmWave) based speech recognition provides more possibility for audio-related applications, such as conference speech transcription and eavesdropping. However, considering the practicality in real scenarios, latency and recognizable vocabulary size are two critical factors that cannot be overlooked. In this paper, we propose Radio2Text, the first mmWave-based system for streaming automatic speech recognition (ASR) with a vocabulary size exceeding 13,000 words. Radio2Text is based on a tailored streaming Transformer that is capable of effectively learning representations of speech-related features, paving the way for streaming ASR with a large vocabulary. To alleviate the deficiency of streaming networks unable to access entire future inputs, we propose the Guidance Initialization that facilitates the transfer of feature knowledge related to the global context from the non-streaming Transformer to the tailored streaming Transformer through weight inheritance. Further, we propose a cross-modal structure based on knowledge distillation (KD), named cross-modal KD, to mitigate the negative effect of low quality mmWave signals on recognition performance. In the cross-modal KD, the audio streaming Transformer provides feature and response guidance that inherit fruitful and accurate speech information to supervise the training of the tailored radio streaming Transformer. The experimental results show that our Radio2Text can achieve a character error rate of 5.7% and a word error rate of 9.4% for the recognition of a vocabulary consisting of over 13,000 words.
翻译:基于毫米波(mmWave)的语音识别为音频相关应用(如会议语音转录和窃听)提供了更多可能性。然而,考虑到实际场景中的实用性,延迟和可识别词汇量是两个不可忽视的关键因素。本文提出了Radio2Text,这是首个基于毫米波、支持流式自动语音识别(ASR)且词汇量超过13000词的系统。Radio2Text基于定制化流式Transformer,该模型能够有效学习语音相关特征的表征,为大词汇量流式ASR铺平了道路。为缓解流式网络无法获取完整未来输入信息的缺陷,我们提出了引导初始化(Guidance Initialization)方法,通过权重继承使非流式Transformer中的全局上下文特征知识迁移至定制化流式Transformer。此外,我们提出了一种基于知识蒸馏(KD)的跨模态结构——跨模态KD,以减轻低质量毫米波信号对识别性能的负面影响。在跨模态KD中,音频流式Transformer提供了继承丰富且准确语音信息的特征与响应引导,用于监督定制化无线电流式Transformer的训练。实验结果表明,我们的Radio2Text在识别超过13000词的词汇表时,字错误率(CER)可达5.7%,词错误率(WER)可达9.4%。