In recent years, End-to-End speech recognition technology based on deep learning has developed rapidly. Due to the lack of Turkish speech data, the performance of Turkish speech recognition system is poor. Firstly, this paper studies a series of speech recognition tuning technologies. The results show that the performance of the model is the best when the data enhancement technology combining speed perturbation with noise addition is adopted and the beam search width is set to 16. Secondly, to maximize the use of effective feature information and improve the accuracy of feature extraction, this paper proposes a new feature extractor LSPC. LSPC and LiGRU network are combined to form a shared encoder structure, and model compression is realized. The results show that the performance of LSPC is better than MSPC and VGGnet when only using Fbank features, and the WER is improved by 1.01% and 2.53% respectively. Finally, based on the above two points, a new multi-feature fusion network is proposed as the main structure of the encoder. The results show that the WER of the proposed feature fusion network based on LSPC is improved by 0.82% and 1.94% again compared with the single feature (Fbank feature and Spectrogram feature) extraction using LSPC. Our model achieves performance comparable to that of advanced End-to-End models.
翻译:近年来,基于深度学习的端到端语音识别技术发展迅速。由于土耳其语音数据匮乏,土耳其语语音识别系统的性能较差。首先,本文研究了一系列语音识别调优技术。结果表明,当采用速度扰动与噪声添加相结合的数据增强技术并将波束搜索宽度设为16时,模型性能最优。其次,为最大化利用有效特征信息并提升特征提取精度,本文提出了一种新型特征提取器LSPC。将LSPC与LiGRU网络结合形成共享编码器结构,并实现了模型压缩。结果表明,仅使用Fbank特征时,LSPC的性能优于MSPC和VGGnet,词错误率分别降低了1.01%和2.53%。最后,基于上述两点,提出了一种新的多特征融合网络作为编码器主体结构。结果表明,基于LSPC的特征融合网络相较于单独使用LSPC提取单特征(Fbank特征和频谱特征),词错误率进一步降低了0.82%和1.94%。我们的模型达到了与先进端到端模型相当的性能。