Large-scale universal speech models (USM) are already used in production. However, as the model size grows, the serving cost grows too. Serving cost of large models is dominated by model size that is why model size reduction is an important research topic. In this work we are focused on model size reduction using weights only quantization. We present the weights binarization of USM Recurrent Neural Network Transducer (RNN-T) and show that its model size can be reduced by 15.9x times at cost of word error rate (WER) increase by only 1.9% in comparison to the float32 model. It makes it attractive for practical applications.
翻译:大规模通用语音模型(USM)已在生产环境中得到应用。然而,随着模型规模的增大,其部署成本也随之增加。大型模型的部署成本主要由模型规模决定,因此模型压缩是一个重要的研究方向。本研究专注于通过仅权重量化来减小模型规模。我们提出了USM循环神经网络传感器(RNN-T)的权重二值化方法,并证明与float32模型相比,其模型规模可减小15.9倍,而词错误率(WER)仅增加1.9%。这使得该方法在实际应用中具有吸引力。