We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering. We selected three tasks: audio-visual speech recognition (AVSR), code-switched speech recognition (CS-ASR), and speech translation (ST) on unseen language pairs. We design task-specific prompts, by either leveraging another large-scale model, or simply manipulating the special tokens in the default prompts. Experiments show that compared to the default prompts, our proposed prompts improve performance by 10% to 45% on the three zero-shot tasks, and even outperform SotA supervised models on some datasets. In addition, our experiments reveal many interesting properties of Whisper, including its robustness to prompts, bias on accents, and the multilingual understanding in its latent space. Code is available at https://github.com/jasonppy/PromptingWhisper
翻译:我们通过提示工程将最新提出的Web规模语音模型Whisper适配到未见任务,探究其涌现能力。我们选取了三项任务:视听语音识别(AVSR)、语码混合语音识别(CS-ASR)以及面向未见语言对的语音翻译(ST)。通过利用另一大规模模型或仅操作默认提示中的特殊标记,我们设计了任务特定提示。实验表明,与默认提示相比,我们提出的提示在三个零样本任务上实现了10%至45%的性能提升,甚至在某些数据集上超越了当前最先进的监督模型。此外,我们的实验揭示了Whisper的诸多有趣特性,包括其对提示的鲁棒性、对口音的偏差以及其隐空间中的多语言理解能力。代码已开源:https://github.com/jasonppy/PromptingWhisper