WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

Real-time automatic speech recognition (ASR) systems face a fundamental trade-off between transcription accuracy and computational efficiency, particularly when deploying large-scale transformer models like Whisper. Existing streaming approaches either sacrifice accuracy through aggressive chunking or incur prohibitive memory costs through unbounded context accumulation. We present WhisperPipe, a novel streaming architecture that achieves bounded memory consumption while maintaining transcription quality through three key innovations a hybrid Voice Activity Detection (VAD) pipeline combining Silero VAD with energy-based filtering to reduce false activations by 34%, a dynamic buffering mechanism with overlapping context windows that prevents information loss at segment boundaries, and an adaptive processing strategy that balances latency and accuracy based on speech characteristics. Evaluated on 2.5 hours of diverse audio data, WhisperPipe demonstrates a median end-to-end latency of 89ms (90th percentile: 142ms) while consuming 48% less peak GPU memory and 80.9% lower average GPU utilization compared to baseline Whisper implementations. The system maintains stable memory usage over extended sessions, with zero growth rate across 150-minute continuous operation. Comparative analysis against related work shows that WhisperPipe achieves competitive accuracy (WER within 2% of offline Whisper) while operating at 3-5x lower latency than existing streaming solutions. The architecture's modular design enables deployment across resource-constrained environments, from edge devices to cloud infrastructure. Our results demonstrate that careful architectural design can reconcile the competing demands of real-time responsiveness and model sophistication in production ASR systems.

翻译：实时自动语音识别系统面临转录准确率与计算效率之间的根本性权衡，尤其是在部署Whisper等大规模Transformer模型时。现有流式方法要么通过激进式分块牺牲准确率，要么因无限制的上下文积累导致高昂的内存开销。本文提出WhisperPipe，一种新型流式架构，通过三项关键创新在保持转录质量的同时实现有界内存消耗：混合语音活动检测流水线（融合Silero VAD与基于能量的滤波，将误激活率降低34%）、带重叠上下文窗口的动态缓冲机制（防止片段边界信息丢失），以及基于语音特征平衡延迟与准确率的自适应处理策略。在2.5小时多样化音频数据上的评估表明：WhisperPipe的中位数端到端延迟为89ms（第90百分位：142ms），同时相较于基线Whisper实现，峰值GPU内存消耗降低48%，平均GPU利用率降低80.9%。系统在长时间会话中保持稳定内存使用，连续运行150分钟内存增长率为零。与相关工作的对比分析表明，WhisperPipe在保持竞争性准确率（词错误率与离线Whisper差异在2%以内）的同时，延迟比现有流式解决方案低3-5倍。该架构的模块化设计使其能够部署在从边缘设备到云基础设施的资源受限环境中。我们的研究结果表明，审慎的架构设计可以调和生产级ASR系统中实时响应性与模型复杂度之间的竞争性需求。