Efficient Whisper on Streaming Speech

Speech foundation models, exemplified by OpenAI's Whisper, have emerged as leaders in speech understanding thanks to their exceptional accuracy and adaptability. However, their usage largely focuses on processing pre-recorded audio, with the efficient handling of streaming speech still in its infancy. Several core challenges underlie this limitation: (1) These models are trained for long, fixed-length audio inputs (typically 30 seconds). (2) Encoding such inputs involves processing up to 1,500 tokens through numerous transformer layers. (3) Generating outputs requires an irregular and computationally heavy beam search. Consequently, streaming speech processing on edge devices with constrained resources is more demanding than many other AI tasks, including text generation. To address these challenges, we introduce Whisper-T, an innovative framework combining both model and system-level optimizations: (1) Hush words, short learnable audio segments appended to inputs, prevent over-processing and reduce hallucinations in the model. (2) Beam pruning aligns streaming audio buffers over time, leveraging intermediate decoding results to significantly speed up the process. (3) CPU/GPU pipelining dynamically distributes resources between encoding and decoding stages, optimizing performance by adapting to variations in audio input, model characteristics, and hardware. We evaluate Whisper-T on ARM-based platforms with 4-12 CPU cores and 10-30 GPU cores, demonstrating latency reductions of 1.6x-4.7x, achieving per-word delays as low as 0.5 seconds with minimal accuracy loss. Additionally, on a MacBook Air, Whisper-T maintains approximately 1-second latency per word while consuming just 7 Watts of total system power.

翻译：以OpenAI的Whisper为代表的语音基础模型凭借其卓越的准确性和适应性，已成为语音理解领域的领先者。然而，这些模型主要应用于处理预录制音频，对流式语音的高效处理仍处于起步阶段。这一局限性源于几个核心挑战：（1）这些模型针对长时固定长度音频输入（通常为30秒）进行训练；（2）编码此类输入需要将多达1500个标记通过众多Transformer层进行处理；（3）生成输出需要执行不规则且计算量巨大的束搜索。因此，在资源受限的边缘设备上进行流式语音处理比包括文本生成在内的许多其他AI任务更为困难。为应对这些挑战，我们提出了Whisper-T——一个融合模型与系统级优化的创新框架：（1）通过添加可学习的短音频片段"Hush words"至输入端，抑制过度处理并减少模型幻觉；（2）束剪枝技术通过利用中间解码结果对齐时序上的流式音频缓冲区，显著加速处理过程；（3）CPU/GPU流水线动态分配编码与解码阶段的资源，通过自适应音频输入变化、模型特性和硬件条件实现性能优化。我们在配备4-12个CPU核心与10-30个GPU核心的ARM平台上评估Whisper-T，结果显示延迟降低1.6-4.7倍，单词处理延迟最低可达0.5秒且精度损失极小。此外，在MacBook Air上，Whisper-T在仅消耗7瓦系统总功耗的同时，仍能保持约1秒的单词处理延迟。