Large speech recognition models like Whisper-small achieve high accuracy but are difficult to deploy on edge devices due to their high computational demand. To this end, we present a unified, cross-library evaluation of post-training quantization (PTQ) on Whisper-small that disentangles the impact of quantization scheme, method, granularity, and bit-width. Our study is based on four libraries: PyTorch, Optimum-Quanto, HQQ, and bitsandbytes. Experiments on LibriSpeech test-clean and test-other show that dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline's word error rate. Static quantization performed worse, likely due to Whisper's Transformer architecture, while more aggressive formats (e.g., nf4, int3) achieved up to 71% compression at the cost of accuracy in noisy conditions. Overall, our results demonstrate that carefully chosen PTQ methods can substantially reduce model size and inference cost without retraining, enabling efficient deployment of Whisper-small on constrained hardware.
翻译:像Whisper-small这样的大型语音识别模型虽然精度高,但由于其计算需求高,难以部署在边缘设备上。为此,我们针对Whisper-small模型的后训练量化(PTQ)开展了一项统一且跨库的评估,旨在厘清量化方案、方法、粒度和位宽的影响。本研究基于四个库:PyTorch、Optimum-Quanto、HQQ和bitsandbytes。在LibriSpeech test-clean和test-other数据集上的实验表明,采用Quanto库的动态int8量化在模型大小减少57%的同时,基线的词错误率(WER)得到改善,实现了最佳权衡。静态量化表现较差,这很可能归因于Whisper的Transformer架构;而更激进的量化格式(如nf4、int3)在噪声条件下虽能以精度为代价,实现高达71%的压缩率。总体而言,我们的结果表明,精心选择的PTQ方法能够在无需重新训练的情况下大幅减少模型大小和推理成本,从而支持Whisper-small模型在受限硬件上的高效部署。