Real-time automatic speech recognition systems are increasingly integrated into interactive applications, from voice assistants to live transcription services. However, scaling these systems to support multiple concurrent clients while maintaining low latency and high accuracy remains a major challenge. In this work, we present SWIM, a novel real-time ASR system built on top of OpenAI's Whisper model that enables true model-level parallelization for scalable, multilingual transcription. SWIM supports multiple concurrent audio streams without modifying the underlying model. It introduces a buffer merging strategy that maintains transcription fidelity while ensuring efficient resource usage. We evaluate SWIM in multi-client settings -- scaling up to 20 concurrent users -- and show that it delivers accurate real-time transcriptions in English, Italian, and Spanish, while maintaining low latency and high throughput. While Whisper-Streaming achieves a word error rate of approximately 8.2% with an average delay of approximately 3.4 s in a single-client, English-only setting, SWIM extends this capability to multilingual, multi-client environments. It maintains comparable accuracy with significantly lower delay -- around 2.4 s with 5 clients -- and continues to scale effectively up to 20 concurrent clients without degrading transcription quality and increasing overall throughput. Our approach advances scalable ASR by improving robustness and efficiency in dynamic, multi-user environments.
翻译:实时自动语音识别系统正日益融入交互式应用,从语音助手到实时转录服务。然而,将这些系统扩展至支持多并发客户端,同时保持低延迟与高精度,仍是一项重大挑战。本研究提出SWIM系统——一个基于OpenAI Whisper模型构建的新型实时ASR系统,通过实现真正的模型级并行化来支持可扩展的多语言转录。SWIM无需修改底层模型即可支持多路并发音频流,其引入的缓冲区合并策略在保障转录保真度的同时实现了资源高效利用。我们在多客户端场景下对SWIM进行评估(最高扩展至20个并发用户),结果表明该系统能在英语、意大利语和西班牙语中提供精准的实时转录,同时保持低延迟与高吞吐量。在单客户端英语场景中,Whisper-Streaming的词错误率约为8.2%,平均延迟约3.4秒;而SWIM将这一能力扩展至多语言多客户端环境,在保持相当精度的同时显著降低延迟(5客户端时延迟约2.4秒),并能有效扩展至20个并发客户端而不降低转录质量,同时提升整体吞吐量。本方法通过增强动态多用户环境下的鲁棒性与效率,推动了可扩展ASR技术的发展。