Automatic speech recognition (ASR) training can utilize multiple experts as teacher models, each trained on a specific domain or accent. Teacher models may be opaque in nature since their architecture may be not be known or their training cadence is different from that of the student ASR model. Still, the student models are updated incrementally using the pseudo-labels generated independently by the expert teachers. In this paper, we exploit supervision from multiple domain experts in training student ASR models. This training strategy is especially useful in scenarios where few or no human transcriptions are available. To that end, we propose a Smart-Weighter mechanism that selects an appropriate expert based on the input audio, and then trains the student model in an unsupervised setting. We show the efficacy of our approach using LibriSpeech and LibriLight benchmarks and find an improvement of 4 to 25\% over baselines that uniformly weight all the experts, use a single expert model, or combine experts using ROVER.
翻译:自动语音识别(ASR)训练可利用多个领域专家作为教师模型,每个模型针对特定领域或口音进行训练。教师模型可能具有黑箱性质,因其架构未知或训练节奏与学生ASR模型不同。尽管如此,学生模型通过专家教师独立生成的伪标签进行增量更新。本文利用多个领域专家的监督信号训练学生ASR模型,该策略在缺少人工转录数据时尤为有效。为此,我们提出智能权重机制(Smart-Weighter),根据输入音频选择最适宜的专家,并在无监督场景下训练学生模型。通过LibriSpeech与LibriLight基准测试,我们验证了该方法有效性,相较于均匀加权所有专家、使用单一专家模型或采用ROVER进行专家融合的基线方案,获得了4%至25%的性能提升。