With increasingly more powerful compute capabilities and resources in today's devices, traditionally compute-intensive automatic speech recognition (ASR) has been moving from the cloud to devices to better protect user privacy. However, it is still challenging to implement on-device ASR on resource-constrained devices, such as smartphones, smart wearables, and other small home automation devices. In this paper, we propose a series of model architecture adaptions, neural network graph transformations, and numerical optimizations to fit an advanced Conformer based end-to-end streaming ASR system on resource-constrained devices without accuracy degradation. We achieve over 5.26 times faster than realtime (0.19 RTF) speech recognition on small wearables while minimizing energy consumption and achieving state-of-the-art accuracy. The proposed methods are widely applicable to other transformer-based server-free AI applications. In addition, we provide a complete theory on optimal pre-normalizers that numerically stabilize layer normalization in any Lp-norm using any floating point precision.
翻译:随着当今设备计算能力和资源的日益强大,传统上计算密集的自动语音识别(ASR)正从云端迁移至设备端,以更好地保护用户隐私。然而,在资源受限的设备(如智能手机、智能可穿戴设备及其他小型家庭自动化设备)上实现设备端ASR仍具挑战。本文提出一系列模型架构调整、神经网络图变换及数值优化方法,使得基于Conformer的先进端到端流式ASR系统在资源受限设备上运行时准确率不降。我们在小型可穿戴设备上实现速度超过实时5.26倍(0.19倍实时因子)的语音识别,同时最小化能耗并达到最先进准确率。所提方法可广泛应用于其他基于Transformer的无服务器AI应用。此外,我们提出关于最优预归一化器的完整理论,该理论可在任意Lp范数下使用任意浮点精度数值稳定层归一化。