TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models

Yuan Shangguan,Haichuan Yang,Danni Li,Chunyang Wu,Yassir Fathullah,Dilin Wang,Ayushi Dalmia,Raghuraman Krishnamoorthi,Ozlem Kalinli,Junteng Jia,Jay Mahadeokar,Xin Lei,Mike Seltzer,Vikas Chandra

from arxiv, Meta AI; Submitted to ICASSP 2024

Automatic Speech Recognition (ASR) models need to be optimized for specific hardware before they can be deployed on devices. This can be done by tuning the model's hyperparameters or exploring variations in its architecture. Re-training and re-validating models after making these changes can be a resource-intensive task. This paper presents TODM (Train Once Deploy Many), a new approach to efficiently train many sizes of hardware-friendly on-device ASR models with comparable GPU-hours to that of a single training job. TODM leverages insights from prior work on Supernet, where Recurrent Neural Network Transducer (RNN-T) models share weights within a Supernet. It reduces layer sizes and widths of the Supernet to obtain subnetworks, making them smaller models suitable for all hardware types. We introduce a novel combination of three techniques to improve the outcomes of the TODM Supernet: adaptive dropouts, an in-place Alpha-divergence knowledge distillation, and the use of ScaledAdam optimizer. We validate our approach by comparing Supernet-trained versus individually tuned Multi-Head State Space Model (MH-SSM) RNN-T using LibriSpeech. Results demonstrate that our TODM Supernet either matches or surpasses the performance of manually tuned models by up to a relative of 3% better in word error rate (WER), while efficiently keeping the cost of training many models at a small constant.

翻译：自动语音识别（ASR）模型在部署到设备端之前需要针对特定硬件进行优化。这可以通过调整模型超参数或探索其架构变体来实现。修改后重新训练和验证模型可能是一项资源密集型任务。本文提出TODM（一次训练多次部署），这是一种新方法，能以与单次训练任务相当的GPU耗时高效训练多种尺寸的硬件友好型设备端ASR模型。TODM借鉴了超网络（Supernet）的先前研究成果，使循环神经网络换能器（RNN-T）模型在超网络内共享权重。它通过缩减超网络的层尺寸和宽度获得子网络，生成适用于所有硬件类型的更小模型。我们引入三种技术的创新组合以改进TODM超网络的效果：自适应丢弃、原位Alpha散度知识蒸馏以及ScaledAdam优化器的使用。通过比较超网络训练与独立调参的多头状态空间模型（MH-SSM）RNN-T在LibriSpeech上的性能，我们验证了该方法。结果表明，我们的TODM超网络在词错误率（WER）上匹配甚至超越人工调参模型，最高相对提升3%，同时将多模型训练成本高效控制在较小恒定范围内。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日