Continued improvements in machine learning techniques offer exciting new opportunities through the use of larger models and larger training datasets. However, there is a growing need to offer these new capabilities on-board low-powered devices such as smartphones, wearables and other embedded environments where only low memory is available. Towards this, we consider methods to reduce the model size of Conformer-based speech recognition models which typically require models with greater than 100M parameters down to just $5$M parameters while minimizing impact on model quality. Such a model allows us to achieve always-on ambient speech recognition on edge devices with low-memory neural processors. We propose model weight reuse at different levels within our model architecture: (i) repeating full conformer block layers, (ii) sharing specific conformer modules across layers, (iii) sharing sub-components per conformer module, and (iv) sharing decomposed sub-component weights after low-rank decomposition. By sharing weights at different levels of our model, we can retain the full model in-memory while increasing the number of virtual transformations applied to the input. Through a series of ablation studies and evaluations, we find that with weight sharing and a low-rank architecture, we can achieve a WER of 2.84 and 2.94 for Librispeech dev-clean and test-clean respectively with a $5$M parameter model.
翻译:机器学习技术的持续进步通过更大规模的模型和训练数据集开辟了令人振奋的新机遇。然而,在智能手机、可穿戴设备及其他仅具备低内存的嵌入式环境中,提供这些新能力的需求日益增长。为此,我们研究将基于Conformer的语音识别模型(通常需要超过1亿参数)缩小至仅500万参数的方法,同时最小化对模型质量的影响。这种模型使我们能够在配备低内存神经处理器的边缘设备上实现常开环境语音识别。我们在模型架构的不同层级提出权重复用策略:(i) 重复完整的Conformer块层,(ii) 跨层共享特定Conformer模块,(iii) 每个Conformer模块内共享子组件,以及(iv) 低秩分解后共享分解后的子组件权重。通过在模型不同层级共享权重,我们能够在内存中保留完整模型的同时增加作用于输入的虚拟变换数量。通过一系列消融研究和评估发现,采用权重共享和低秩架构,我们可以在500万参数模型上实现Librispeech开发集清洁数据和测试集清洁数据分别达到2.84和2.94的词错误率。