Low-resource ASR remains a challenging problem, especially for languages like Arabic that exhibit wide dialectal variation and limited labeled data. We propose context-aware prompting strategies to adapt OpenAI's Whisper for Arabic speech recognition without retraining. Our methods include decoder prompting with first-pass transcriptions or retrieved utterances, and encoder prefixing using speech synthesized in the target speaker's voice. We introduce techniques such as prompt reordering, speaker-aware prefix synthesis, and modality-specific retrieval (lexical, semantic, acoustic) to improve transcription in real-world, zero-shot settings. Evaluated on nine Arabic linguistic conditions, our approach reduces WER by up to 22.3% on Modern Standard Arabic and 9.2% on dialectal speech, significantly mitigating hallucinations and speaker mismatch.
翻译:低资源自动语音识别(ASR)仍然是一个具有挑战性的问题,尤其对于像阿拉伯语这样存在广泛方言变体且标注数据有限的语言。我们提出了上下文感知提示策略,旨在无需重新训练的情况下,将OpenAI的Whisper模型适配于阿拉伯语语音识别任务。我们的方法包括:使用首轮转写文本或检索到的语音片段进行解码器提示,以及利用目标说话人语音合成的音频进行编码器前缀注入。我们引入了提示重排序、说话人感知前缀合成和模态特定检索(词汇、语义、声学)等技术,以提升在现实世界零样本场景下的转写性能。在九种阿拉伯语语言条件下进行评估,我们的方法将现代标准阿拉伯语的词错误率(WER)降低了最高22.3%,方言语音降低了9.2%,显著缓解了幻觉生成和说话人不匹配问题。