With increasingly more powerful compute capabilities and resources in today's devices, traditionally compute-intensive automatic speech recognition (ASR) has been moving from the cloud to devices to better protect user privacy. However, it is still challenging to implement on-device ASR on resource-constrained devices, such as smartphones, smart wearables, and other smart home automation devices. In this paper, we propose a series of model architecture adaptions, neural network graph transformations, and numerical optimizations to fit an advanced Conformer based end-to-end streaming ASR system on resource-constrained devices without accuracy degradation. We achieve over 5.26 times faster than realtime (0.19 RTF) speech recognition on smart wearables while minimizing energy consumption and achieving state-of-the-art accuracy. The proposed methods are widely applicable to other transformer-based server-free AI applications. In addition, we provide a complete theory on optimal pre-normalizers that numerically stabilize layer normalization in any Lp-norm using any floating point precision.
翻译:随着当今设备计算能力和资源日益强大,传统上计算密集型的自动语音识别(ASR)正从云端迁移至设备端,以更好地保护用户隐私。然而,在资源受限的设备(如智能手机、智能穿戴设备及其他智能家居自动化设备)上实现设备端ASR仍面临挑战。本文提出了一系列模型架构调整、神经网络图变换和数值优化方法,在不损失精度的情况下,将先进的基于Conformer的端到端流式ASR系统部署至资源受限设备上。我们在智能穿戴设备上实现了比实时快5.26倍以上(0.19 RTF)的语音识别,同时最小化能耗并达到当前最优精度。所提方法广泛适用于其他基于Transformer的无服务器AI应用。此外,我们提供了一套完整的关于最优预归一化器的理论,该理论可在任意Lp范数下使用任意浮点精度数值稳定层归一化。