We analyze the impact of speaker adaptation in end-to-end automatic speech recognition models based on transformers and wav2vec 2.0 under different noise conditions. By including speaker embeddings obtained from x-vector and ECAPA-TDNN systems, as well as i-vectors, we achieve relative word error rate improvements of up to 16.3% on LibriSpeech and up to 14.5% on Switchboard. We show that the proven method of concatenating speaker vectors to the acoustic features and supplying them as auxiliary model inputs remains a viable option to increase the robustness of end-to-end architectures. The effect on transformer models is stronger, when more noise is added to the input speech. The most substantial benefits for systems based on wav2vec 2.0 are achieved under moderate or no noise conditions. Both x-vectors and ECAPA-TDNN embeddings outperform i-vectors as speaker representations. The optimal embedding size depends on the dataset and also varies with the noise condition.
翻译:我们分析了基于Transformer和wav2vec 2.0的端到端自动语音识别模型在不同噪声条件下的说话人自适应影响。通过引入从x-vector和ECAPA-TDNN系统以及i-vector中获取的说话人嵌入向量,我们在LibriSpeech上实现了高达16.3%的相对词错误率改进,在Switchboard上实现了高达14.5%的改进。我们证明,将说话人向量与声学特征拼接并作为辅助模型输入这一经过验证的方法,仍然是增强端到端架构鲁棒性的可行选择。当输入语音中加入更多噪声时,Transformer模型的效果更为显著。而基于wav2vec 2.0的系统在中度或无噪声条件下获得了最显著的收益。作为说话人表征,x-vector和ECAPA-TDNN嵌入向量的性能均优于i-vector。最优嵌入向量大小取决于数据集,且随噪声条件变化。