Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), CS-ASR is still a challenging task ought to the grammatical structure complexity of the phenomenon and the data scarcity of specific training corpus. In this work, we propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem. Specifically, we first employ multiple well-trained ASR models for N-best hypotheses generation, with the aim of increasing the diverse and informative elements in the set of hypotheses. Next, we utilize the LLMs to learn the hypotheses-to-transcription (H2T) mapping by adding a trainable low-rank adapter. Such a generative error correction (GER) method directly predicts the accurate transcription according to its expert linguistic knowledge and N-best hypotheses, resulting in a paradigm shift from the traditional language model rescoring or error correction techniques. Experimental evidence demonstrates that GER significantly enhances CS-ASR accuracy, in terms of reduced mixed error rate (MER). Furthermore, LLMs show remarkable data efficiency for H2T learning, providing a potential solution to the data scarcity problem of CS-ASR in low-resource languages.
翻译:语码转换(CS)语音指在同一句子中混合两种或多种语言的现象。尽管自动语音识别(ASR)技术近期取得进展,但由于该现象的语法结构复杂性及特定训练语料的数据稀缺性,CS-ASR仍是具有挑战性的任务。在本工作中,我们提出利用大语言模型(LLMs)及ASR生成的假设列表来解决语码转换问题。具体而言,我们首先采用多个训练有素的ASR模型进行N最佳假设生成,旨在增强假设集合中元素的多样性与信息量;其次,通过添加可训练的低秩适配器,使LLMs学习假设到转录(H2T)的映射。这种生成式纠错(GER)方法凭借其专家级语言知识与N最佳假设直接预测准确转录,从而实现了从传统语言模型重评分或纠错技术的范式转变。实验证据表明,GER显著提升了CS-ASR的准确性(以混合错误率(MER)降低为指标)。此外,LLMs在H2T学习中展现出卓越的数据效率,为低资源语言中CS-ASR的数据稀缺问题提供了潜在解决方案。