Pretrained automatic speech recognition (ASR) models such as Whisper perform well but still need domain adaptation to handle unseen parlance. In many real-world settings, collecting speech data is impractical, necessitating text-only adaptation. We propose WhisTLE, a deeply supervised, text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE trains a variational autoencoder (VAE) to model encoder outputs from text and fine-tunes the decoder using the learned text-to-latent encoder, optionally combined with text-to-speech (TTS) adaptation. At inference, the original encoder is restored, incurring no extra runtime cost. Across four datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by a relative 49.0% and outperforms all non-WhisTLE baselines in 100 of 112 scenarios. We also find that WhisTLE additively complements any combination of other domain adaptation approaches; we thus recommend the inclusion of WhisTLE during standard processes for adapting encoder-decoder ASR models.
翻译:预训练自动语音识别(ASR)模型(如Whisper)虽性能优异,但仍需域适应以处理未见口语表达。现实场景中收集语音数据常不具可行性,因此纯文本适应方法不可或缺。本文提出WhisTLE——一种面向预训练编解码器ASR模型的深度监督纯文本适应方法。WhisTLE训练变分自编码器(VAE)从文本建模编码器输出,并利用学习得到的文本到潜在编码器微调解码器,可选择性结合文本到语音(TTS)适应。推理时恢复原始编码器,不增加额外运行时开销。在四个数据集及四个ASR模型上的实验表明,结合TTS的WhisTLE使词错误率(WER)相对降低49.0%,并在112个场景中的100个场景中优于所有非WhisTLE基线。此外,WhisTLE能与任意其他域适应方法组合产生叠加增效;因此,我们推荐在标准编解码器ASR模型适应流程中纳入WhisTLE。