Rich sources of variability in natural speech present significant challenges to current data intensive speech recognition technologies. To model both speaker and environment level diversity, this paper proposes a novel Bayesian factorised speaker-environment adaptive training and test time adaptation approach for Conformer ASR models. Speaker and environment level characteristics are separately modeled using compact hidden output transforms, which are then linearly or hierarchically combined to represent any speaker-environment combination. Bayesian learning is further utilized to model the adaptation parameter uncertainty. Experiments on the 300-hr WHAM noise corrupted Switchboard data suggest that factorised adaptation consistently outperforms the baseline and speaker label only adapted Conformers by up to 3.1% absolute (10.4% relative) word error rate reductions. Further analysis shows the proposed method offers potential for rapid adaption to unseen speaker-environment conditions.
翻译:自然语音中丰富多变的因素给当前数据密集型的语音识别技术带来了重大挑战。为同时建模说话人和环境层面的多样性,本文针对Conformer ASR模型提出了一种新颖的贝叶斯因子化说话人-环境自适应训练及测试时自适应方法。说话人和环境层面的特征分别采用紧凑型隐藏输出变换进行建模,随后通过线性或层级组合方式表征任意说话人-环境组合。进一步利用贝叶斯学习建模自适应参数的不确定性。在300小时WHAM噪声干扰的Switchboard数据上的实验表明,因子化自适应方法持续优于基线模型及仅采用说话人标签自适应的Conformer模型,词错误率最高降低3.1%绝对值(10.4%相对值)。进一步分析表明,所提方法在应对未见过的说话人-环境条件时具备快速自适应的潜力。