This paper investigates the in-context learning abilities of the Whisper automatic speech recognition (ASR) models released by OpenAI. A novel speech-based in-context learning (SICL) approach is proposed for test-time adaptation, which can reduce the word error rates (WERs) with only a small number of labelled speech samples without gradient descent. Language-level adaptation experiments using Chinese dialects showed that when applying SICL to isolated word ASR, consistent and considerable relative WER reductions can be achieved using Whisper models of any size on two dialects, which is on average 32.3%. A k-nearest-neighbours-based in-context example selection technique can be applied to further improve the efficiency of SICL, which can increase the average relative WER reduction to 36.4%. The findings are verified using speaker adaptation or continuous speech recognition tasks, and both achieved considerable relative WER reductions. Detailed quantitative analyses are also provided to shed light on SICL's adaptability to phonological variances and dialect-specific lexical nuances.
翻译:本文研究了OpenAI发布的Whisper自动语音识别(ASR)模型的上下文学习能力。提出了一种新颖的基于语音的上下文学习方法(SICL)用于测试时自适应,该方法仅需少量带标签语音样本即可降低词错误率(WER),无需梯度下降。使用中文方言进行的语言级自适应实验表明,将SICL应用于孤立词ASR时,两种方言上任意规模的Whisper模型均可实现一致且显著的相对WER降低,平均降幅达32.3%。基于k近邻的上下文示例选择技术可进一步提升SICL效率,将平均相对WER降幅提高至36.4%。通过说话人自适应或连续语音识别任务验证了该发现,两者均实现了显著相对WER降低。此外,还提供了详细的定量分析以揭示SICL对语音变异及方言特定词汇细节的适应性。