Neural speaker embeddings encode the speaker's speech characteristics through a DNN model and are prevalent for speaker verification tasks. However, few studies have investigated the usage of neural speaker embeddings for an ASR system. In this work, we present our efforts w.r.t integrating neural speaker embeddings into a conformer based hybrid HMM ASR system. For ASR, our improved embedding extraction pipeline in combination with the Weighted-Simple-Add integration method results in x-vector and c-vector reaching on par performance with i-vectors. We further compare and analyze different speaker embeddings. We present our acoustic model improvements obtained by switching from newbob learning rate schedule to one cycle learning schedule resulting in a ~3% relative WER reduction on Switchboard, additionally reducing the overall training time by 17%. By further adding neural speaker embeddings, we gain additional ~3% relative WER improvement on Hub5'00. Our best Conformer-based hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and Hub5'01 with training on SWB 300h.
翻译:神经说话人嵌入通过深度神经网络模型编码说话人的语音特征,在说话人验证任务中广泛应用。然而,关于将神经说话人嵌入应用于自动语音识别系统的研究尚不充分。本文介绍了我们将神经说话人嵌入集成到基于Conformer的混合隐马尔可夫模型语音识别系统中的研究工作。针对语音识别任务,我们改进的嵌入提取流程结合加权简单相加集成方法,使x-vector和c-vector的性能达到与i-vector相当的水平。我们进一步比较并分析了不同说话人嵌入的性能差异。通过将学习率调度从newbob策略切换为单周期学习策略,我们实现了声学模型的改进:在Switchboard数据集上词错误率相对降低约3%,同时训练总时间减少17%。通过进一步引入神经说话人嵌入,我们在Hub5'00数据集上获得约3%的额外词错误率相对改善。采用SWB 300小时训练数据,我们基于Conformer的混合语音识别系统结合说话人嵌入后,在Hub5'00和Hub5'01数据集上取得了9.0%的词错误率。