Synthetic data is widely used in speech recognition due to the availability of text-to-speech models, which facilitate adapting models to previously unseen text domains. However, existing methods suffer in performance when they fine-tune an automatic speech recognition (ASR) model on synthetic data as they suffer from the distributional shift commonly referred to as the synthetic-to-real gap. In this paper, we find that task vector arithmetic is effective at mitigating this gap. Our proposed method, SYN2REAL task vector, shows an average improvement of 10.03\% improvement in word error rate over baselines on the SLURP dataset. Additionally, we show that an average of SYN2REAL task vectors, when we have real speeches from multiple different domains, can further adapt the original ASR model to perform better on the target text domain.
翻译:由于文本到语音模型的普及,合成数据在语音识别领域被广泛用于将模型适配至先前未见过的文本领域。然而,现有方法在合成数据上对自动语音识别模型进行微调时,常因分布偏移(通常称为合成到真实数据差距)而导致性能下降。本文发现,任务向量算术能有效缓解这一差距。我们提出的SYN2REAL任务向量方法在SLURP数据集上相比基线模型,词错误率平均提升了10.03%。此外,当存在来自多个不同领域的真实语音时,对SYN2REAL任务向量进行平均处理,可进一步使原始ASR模型适配目标文本领域,从而获得更优性能。