Automatic Speech Recognition (ASR) models need to be optimized for specific hardware before they can be deployed on devices. This can be done by tuning the model's hyperparameters or exploring variations in its architecture. Re-training and re-validating models after making these changes can be a resource-intensive task. This paper presents TODM (Train Once Deploy Many), a new approach to efficiently train many sizes of hardware-friendly on-device ASR models with comparable GPU-hours to that of a single training job. TODM leverages insights from prior work on Supernet, where Recurrent Neural Network Transducer (RNN-T) models share weights within a Supernet. It reduces layer sizes and widths of the Supernet to obtain subnetworks, making them smaller models suitable for all hardware types. We introduce a novel combination of three techniques to improve the outcomes of the TODM Supernet: adaptive dropouts, an in-place Alpha-divergence knowledge distillation, and the use of ScaledAdam optimizer. We validate our approach by comparing Supernet-trained versus individually tuned Multi-Head State Space Model (MH-SSM) RNN-T using LibriSpeech. Results demonstrate that our TODM Supernet either matches or surpasses the performance of manually tuned models by up to a relative of 3% better in word error rate (WER), while efficiently keeping the cost of training many models at a small constant.
翻译:自动语音识别(ASR)模型在部署到设备端之前需要针对特定硬件进行优化。这可以通过调整模型超参数或探索其架构变体来实现。修改后重新训练和验证模型可能是一项资源密集型任务。本文提出TODM(一次训练多次部署),这是一种新方法,能以与单次训练任务相当的GPU耗时高效训练多种尺寸的硬件友好型设备端ASR模型。TODM借鉴了超网络(Supernet)的先前研究成果,使循环神经网络换能器(RNN-T)模型在超网络内共享权重。它通过缩减超网络的层尺寸和宽度获得子网络,生成适用于所有硬件类型的更小模型。我们引入三种技术的创新组合以改进TODM超网络的效果:自适应丢弃、原位Alpha散度知识蒸馏以及ScaledAdam优化器的使用。通过比较超网络训练与独立调参的多头状态空间模型(MH-SSM)RNN-T在LibriSpeech上的性能,我们验证了该方法。结果表明,我们的TODM超网络在词错误率(WER)上匹配甚至超越人工调参模型,最高相对提升3%,同时将多模型训练成本高效控制在较小恒定范围内。