We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering. We selected three tasks: audio-visual speech recognition (AVSR), code-switched speech recognition (CS-ASR), and speech translation (ST) on unseen language pairs. We design task-specific prompts, by either leveraging another large-scale model, or simply manipulating the special tokens in the default prompts. Experiments show that compared to the default prompts, our proposed prompts improve performance by 10% to 45% on the three zero-shot tasks, and even outperform SotA supervised models on some datasets. In addition, our experiments reveal many interesting properties of Whisper, including its robustness to prompts, bias on accents, and the multilingual understanding in its latent space. Code is available at https://github.com/jasonppy/PromptingWhisper
翻译:我们研究了近期提出的网络规模语音模型Whisper的新兴能力,通过提示工程将其适配到未见任务中。我们选取了三个任务:音视频语音识别、代码切换语音识别以及未知语言对的语音翻译。我们通过利用另一个大规模模型,或简单操控默认提示中的特殊标记,设计了任务特定提示。实验表明,与默认提示相比,我们提出的提示在这三个零样本任务上提升了10%至45%的性能,甚至在某些数据集上超越了当前最先进的监督模型。此外,我们的实验揭示了Whisper的许多有趣特性,包括其对提示的鲁棒性、口音偏差以及潜在空间中的多语言理解能力。代码见https://github.com/jasonppy/PromptingWhisper