Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference.
翻译:Whisper是近期最先进的多语言语音识别与翻译模型之一,但其并非为实时转录而设计。本文在Whisper基础上构建了Whisper-Streaming,实现了类Whisper模型的实时语音转录与翻译。Whisper-Streaming采用自适应延迟的本地一致性策略,支持流式转录。实验表明,该方案在无分割长语音转录测试集上取得了高质量结果,延迟仅为3.3秒,并在多语言会议的实时转录服务中验证了其鲁棒性和实际可用性。