Deep Learning Models in Speech Recognition: Measuring GPU Energy Consumption, Impact of Noise and Model Quantization for Edge Deployment

Recent transformer-based ASR models have achieved word-error rates (WER) below 4%, surpassing human annotator accuracy, yet they demand extensive server resources, contributing to significant carbon footprints. The traditional server-based architecture of ASR also presents privacy concerns, alongside reliability and latency issues due to network dependencies. In contrast, on-device (edge) ASR enhances privacy, boosts performance, and promotes sustainability by effectively balancing energy use and accuracy for specific applications. This study examines the effects of quantization, memory demands, and energy consumption on the performance of various ASR model inference on the NVIDIA Jetson Orin Nano. By analyzing WER and transcription speed across models using FP32, FP16, and INT8 quantization on clean and noisy datasets, we highlight the crucial trade-offs between accuracy, speeds, quantization, energy efficiency, and memory needs. We found that changing precision from fp32 to fp16 halves the energy consumption for audio transcription across different models, with minimal performance degradation. A larger model size and number of parameters neither guarantees better resilience to noise, nor predicts the energy consumption for a given transcription load. These, along with several other findings offer novel insights for optimizing ASR systems within energy- and memory-limited environments, crucial for the development of efficient on-device ASR solutions. The code and input data needed to reproduce the results in this article are open sourced are available on [https://github.com/zzadiues3338/ASR-energy-jetson].

翻译：近期基于Transformer的ASR模型实现了低于4%的词错误率（WER），超越了人类标注精度，但其对大量服务器资源的依赖导致了显著的碳排放。传统的服务器端ASR架构还存在隐私问题，以及因网络依赖引发的可靠性和延迟问题。相比之下，设备端（边缘）ASR通过针对特定应用有效平衡能耗与精度，增强了隐私保护、提升了性能并促进了可持续性。本研究在NVIDIA Jetson Orin Nano平台上，探讨了量化、内存需求和能耗对多种ASR模型推理性能的影响。通过分析不同模型在FP32、FP16和INT8量化条件下，针对干净数据集和含噪数据集的WER及转录速度，我们揭示了精度、速度、量化、能效和内存需求之间的关键权衡关系。研究发现，对于不同模型，将精度从fp32切换至fp16可使音频转录的能耗减半，且性能退化极小。更大的模型规模和更多参数既不能保证更强的抗噪性，也无法预测给定转录负载下的能耗。这些发现与其它多项结果共同为在能耗和内存受限环境下优化ASR系统提供了新见解，对开发高效设备端ASR解决方案至关重要。本文中复现结果所需的代码和输入数据已开源，可从[https://github.com/zzadiues3338/ASR-energy-jetson]获取。