Memory constraint of always-on devices is one of the major concerns when deploying speech processing models on these devices. While larger models trained with sufficiently large amount of data generally perform better, making them fit in the device memory is a demanding challenge. In this paper, we aim to reduce model size by reparameterizing model weights across Transformer encoder layers and assuming a special weight composition and structure. More specifically, inspired by ResNet and the more recent LoRA work, we propose an approach named ResidualTransformer, where each weight matrix in a Transformer layer comprises 1) a shared full-rank component with its adjacent layers, and 2) a unique low-rank component to itself. The low-rank matrices only account for a small amount of model size increase. In addition, we add diagonal weight matrices to improve modeling capacity of the low-rank matrices. Experiments of our 10k-hour speech recognition and speech translation tasks show that the Transformer encoder size can be reduced by ~3X with very slight performance degradation.
翻译:始终在线设备的存储限制是部署语音处理模型时的主要顾虑之一。尽管使用足够大数据量训练的大模型通常表现更优,但如何将其适配到设备内存中仍是一项严峻挑战。本文通过跨Transformer编码器层重新参数化模型权重,并假设特殊的权重构成与结构,旨在减小模型尺寸。具体而言,受ResNet及近期LoRA工作的启发,我们提出名为残差Transformer的方法,其中Transformer层内的每个权重矩阵由两部分组成:1)与相邻层共享的全秩分量,2)自身独有的低秩分量。低秩矩阵仅带来少量模型尺寸增长。此外,我们添加对角权重矩阵以提升低秩矩阵的建模能力。在10000小时语音识别与语音翻译任务上的实验表明,Transformer编码器尺寸可缩减约3倍,且性能下降极小。