With increasingly more powerful compute capabilities and resources in today's devices, traditionally compute-intensive automatic speech recognition (ASR) has been moving from the cloud to devices to better protect user privacy. However, it is still challenging to implement on-device ASR on resource-constrained devices, such as smartphones, smart wearables, and other small home automation devices. In this paper, we propose a series of model architecture adaptions, neural network graph transformations, and numerical optimizations to fit an advanced Conformer based end-to-end streaming ASR system on resource-constrained devices without accuracy degradation. We achieve over 5.26 times faster than realtime (0.19 RTF) speech recognition on small wearables while minimizing energy consumption and achieving state-of-the-art accuracy. The proposed methods are widely applicable to other transformer-based server-free AI applications. In addition, we provide a complete theory on optimal pre-normalizers that numerically stabilize layer normalization in any Lp-norm using any floating point precision.
翻译:随着当今设备计算能力和资源日益强大,传统上计算密集型的自动语音识别(ASR)正从云端迁移至设备端,以更好地保护用户隐私。然而,在资源受限的设备(如智能手机、智能可穿戴设备及其他小型家庭自动化设备)上实现设备端ASR仍具挑战性。本文提出一系列模型架构调整、神经网络图变换及数值优化方法,在不降低准确率的前提下,将先进的基于Conformer的端到端流式语音识别系统部署至资源受限设备。我们在小型可穿戴设备上实现了超过5.26倍实时(0.19 RTF)的语音识别速度,同时最小化能耗并达到最先进的准确率。所提出的方法可广泛应用于其他基于Transformer的无服务器AI应用。此外,我们提出了一套关于最优预归一化器的完整理论,该理论能在任意Lp范数下使用任意浮点精度实现层归一化的数值稳定。