In recent years, a great deal of attention has been paid to the Transformer network for speech recognition tasks due to its excellent model performance. However, the Transformer network always involves heavy computation and large number of parameters, causing serious deployment problems in devices with limited computation sources or storage memory. In this paper, a new lightweight model called Sim-T has been proposed to expand the generality of the Transformer model. Under the help of the newly developed multiplexing technique, the Sim-T can efficiently compress the model with negligible sacrifice on its performance. To be more precise, the proposed technique includes two parts, that are, module weight multiplexing and attention score multiplexing. Moreover, a novel decoder structure has been proposed to facilitate the attention score multiplexing. Extensive experiments have been conducted to validate the effectiveness of Sim-T. In Aishell-1 dataset, when the proposed Sim-T is 48% parameter less than the baseline Transformer, 0.4% CER improvement can be obtained. Alternatively, 69% parameter reduction can be achieved if the Sim-T gives the same performance as the baseline Transformer. With regard to the HKUST and WSJ eval92 datasets, CER and WER will be improved by 0.3% and 0.2%, respectively, when parameters in Sim-T are 40% less than the baseline Transformer.
翻译:近年来,Transformer网络因其卓越的模型性能在语音识别任务中备受关注。然而,Transformer网络始终涉及高计算量和大量参数,导致在计算资源或存储内存有限的设备上存在严重的部署问题。本文提出了一种名为Sim-T的新型轻量级模型,以扩展Transformer模型的通用性。借助新开发的复用技术,Sim-T能够在几乎不牺牲性能的前提下有效压缩模型。具体而言,所提出的技术包括两部分:模块权重复用和注意力分数复用。此外,还提出了一种新颖的解码器结构以促进注意力分数复用。通过大量实验验证了Sim-T的有效性。在Aishell-1数据集上,当所提出的Sim-T比基线Transformer参数减少48%时,可获得0.4%的字符错误率(CER)改善。若Sim-T达到与基线Transformer相同的性能,则可实现69%的参数减少。对于HKUST和WSJ eval92数据集,当Sim-T的参数比基线Transformer减少40%时,CER和词错误率(WER)将分别改善0.3%和0.2%。