ASR can be improved by multi-task learning (MTL) with domain enhancing or domain adversarial training, which are two opposite objectives with the aim to increase/decrease domain variance towards domain-aware/agnostic ASR, respectively. In this work, we study how to best apply these two opposite objectives with speaker labels to improve conformer-based ASR. We also propose a novel adaptive gradient reversal layer for stable and effective adversarial training without tuning effort. Detailed analysis and experimental verification are conducted to show the optimal positions in the ASR neural network (NN) to apply speaker enhancing and adversarial training. We also explore their combination for further improvement, achieving the same performance as i-vectors plus adversarial training. Our best speaker-based MTL achieves 7\% relative improvement on the Switchboard Hub5'00 set. We also investigate the effect of such speaker-based MTL w.r.t. cleaner dataset and weaker ASR NN.
翻译:自动语音识别(ASR)可通过多任务学习(MTL)与领域增强或领域对抗训练来改进,这两种目标相反的方法分别旨在增加/减少领域方差,以实现领域感知/不可知的ASR。本文研究如何最佳地将这两种相反目标与说话人标签结合,以改进基于conformer的ASR。我们还提出一种新颖的自适应梯度反转层,用于实现无需调参的稳定高效对抗训练。通过详细分析和实验验证,确定了在ASR神经网络(NN)中应用说话人增强与对抗训练的最优位置。我们还探索了这两种方法的组合以进一步提升性能,达到了与i-vectors加对抗训练相同的效果。我们基于说话人的最佳MTL在Switchboard Hub5'00数据集上实现了7%的相对改进。此外,我们还研究了此类基于说话人的MTL在更干净数据集和较弱ASR神经网络下的效果。