Despite recent improvements in End-to-End Automatic Speech Recognition (E2E ASR) systems, the performance can degrade due to vocal characteristic mismatches between training and testing data, particularly with limited target speaker adaptation data. We propose a novel speaker adaptation approach Speaker-Smoothed kNN that leverages k-Nearest Neighbors (kNN) retrieval techniques to improve model output by finding correctly pronounced tokens from its pre-built datastore during the decoding phase. Moreover, we utilize x-vector to dynamically adjust kNN interpolation parameters for data sparsity issue. This approach was validated using KeSpeech and MagicData corpora under in-domain and all-domain settings. Our method consistently performs comparably to fine-tuning without the associated performance degradation during speaker changes. Furthermore, in the all-domain setting, our method achieves state-of-the-art results, reducing the CER in both single speaker and multi-speaker test scenarios.
翻译:尽管端到端自动语音识别系统近年来取得了显著进展,但由于训练数据与测试数据在发声特征上的不匹配,特别是在目标说话人自适应数据有限的情况下,系统性能仍可能出现下降。本文提出一种新颖的说话人自适应方法——说话人平滑k近邻检索,该方法利用k近邻检索技术,在解码阶段通过从预构建的数据存储中检索发音正确的标记来改进模型输出。此外,我们采用x-向量动态调整k近邻插值参数以缓解数据稀疏问题。该方法在KeSpeech和MagicData语料库上进行了验证,涵盖领域内和全领域两种设置。实验表明,本方法在说话人切换时能够保持与微调方法相当的性能,且不会出现相关性能下降。在全领域设置下,本方法取得了最先进的性能,在单说话人和多说话人测试场景中均显著降低了字错误率。